FactGrid talk:Subscription lists

From FactGrid
Revision as of 16:08, 31 July 2024 by Olaf Simons (talk | contribs)
Jump to navigation Jump to search

British Music Subscriptions

The present list has 156,536 lines, each a subscription - usually one copy, sometimes more copies. We could create an item per subscription to then create some 100,000 items for the individual subscribers that could be assumed as agents behind the subscriptions. The tentative isolation of these agents would be necessary to make more definite statements about the customers. How many one-time customers do we have in this set? How many addicted music fans are there? How many men, hoe many women, how does the audience change in the course of the century. What are the statistics for professions?

In the end we would have 156,536 subscriptions connected to may be 90.000 customers - a mass of items with almost identical amounts of information on each of the corresponding items.

It is therefore preferable to take the step towards an identification of customers before the input and to then focus on the customers in all ensuing work. The subscriptions will now just become information on the individual customers. Some customers are famous like the members of the royal family. Other customers are functional generic shrouds with statements like "A Lady".

The identification of big shots like the King in a given year or like George Frederick Handel or of smaller shots like university members at Oxford and Cambridge or book sellers in various cities will be later the work for researches.

A fist identification should stay just in the set. We should try to understand who appears in these 156,536 subscriptions again and again. Here we need a tentative internal identification procedure under the set of rules. Someone is the same with the same name place and profession. In other case we may assume with a high degree of certainty that this is a customer whom we already noted in a pervious subscription. No customer will appear twice in the same subscription (any such customer would be listed once with several "sets" ordered).

Segmentation

The information of each subscription can be split under the following headers:

  • Family name
  • Given name
  • Profession
  • Status (Mr., Miss, Esq. King, Hon[oura]ble., etc.)
  • Gender (to be inferred from status and given name (FactGrid has a name list with gender attributions)
  • Organisational context (shop, association, club, choir)
  • Place of address
  • Number of sets ordered
  • Payment
  • amount
  • unit
  • Publication (col. E)
  • Date of publication (col. F)
  • Place of publication (col. G)
  • Publisher (col. H)

Information is rarely given on all headings and individual statements (and abbreviations) can vary

Subscriptions that provide enough information to allow a provisional/actual identification

Most of the subscriptions come with identifiers that did the job back in the 18th century: A family name, gender (Miss, Mr., Seignor...) a given name, a place name, a profession. Some of the subscribers are organisations or companies (choirs, booksellers). Most of them are individuals in the full range from "The King" and "George Frederic Handel, Esq;" to "Lady Brown" (Y3536) who might (or might never) become identifiable in the hands of specialists.

It would be good to aggregate information before we feed the data into FactGrid. Creating individual items in later work will be painful; merging thousands of items will be equally painful. The ideal data set is already an interpretation of entities that should be kept apart, but these should come with transparent indications of potential "Merger candidates" or with warnings that a separation of data might be necessary.

FactGrid knows 18th-century composers (noted in Wikidata) but the British posopographic dataset is more or less empty, so that it remains easy to match proposals against the present database.

The interesting matching process will focus on identities within the set due to the fact that we have up to 50,000 music lovers in this set who ordered several titles.

The optimal solution is a plausible proposal of the identities to create under rules such as:

  • 100% recommendation - no matches within the dataset
  • 90% recommendation - if names, professions and places are the same within three decades
  • 80% recommendation - if full names and places are the same within three decades
  • 75% recommendation - if last names and places are the same within three decades
  • 70% recommendation - if status, name and place are matches
  • 65% recommendation - if full names and professions are the same within three decades (though places differ)
  • 30%-65% recommendation - "manual" look requested
  • < 29% simple item creation is proposed though with identification of possible matches

"Manual" scans of proposals would be assisted by subsets that present the data sorted by name, place, status or profession (with the respective dates of publication). One could in this case go through all candidates from York or Bath and decide with human knowledge.