FactGrid talk:Musical Subscriptions

From FactGrid
Jump to navigation Jump to search

First data input: 770 Titles — authors, publishers, Genres

Dear Martin, dear Simon, the first input has been done - basically with the aim to disentangle the web: You have spotted 156,536 subscriptions on 770 titles. The Titles are on FactGrid with dates and places of publications. I did not yet go into the 490 authors. FactGrid knows more than 9000 composers but the automatic matching worked only on 90 of your 490. (See this Google Sheet - it has our composers, an automatic matching column and your authors). Some of your "authors" happen to be rather poets (they need to be created). Some are anonymous Gentlemen - here you have to decide whether you want to give them database objects or leave the authorship question simply without statement. The items are cool if you can still add more information on these unknown people.

I will create the 400 missing authors any moment with slightly more information, if that should be available or just as place holders if that facilitates things --Olaf Simons (talk) 13:22, 23 October 2021 (CEST)

Upcoming data input: 156,536 subscriptions

Here a passage to give an idea of the challenge:

HENRY Avery, Esq; 4 Sets
John Abery, Esq; Reading, Berks
Shuckborough Ashby, Esq; York
Rev. Mr. Awberry, Fellow of New College, Oxon
Mrs. Ashby, 3 Sets
Mrs. Ackland, 2 Sets
Mrs. Ashley
Mr. Atwell
Mr. Asgill
Miss Atkins, 2 Sets
Miss Angel
Mr. Aris
The Right Honourable the Earl of Brooke
Lady Bucke, 2 Sets
Miss Bucke
Anthony Blagrave, Esq; Southcot, Berks
— Brougham, Esq;
Thomas Baker, Esq; Farnham, Surrey
Rev. Mr. Bridges, York
Mr. Beard, 2 Sets
Mr. James Bartlet, Holmes Chapel, Cheshire
Mrs. Bance, 4 Sets
Mr. Bruce, 2 Sets
...

40,000 hidden double records?

The number of subscriptions is massive, leading into big data (under 18th-century standards): 770 titles meet 156,536 subscriptions. With the Q-numbers we can take weight off the spreadsheet which is circulating within the project. Each line is one subscription. The Q-Number in Col. D. states the object on each subscription. Col. A is a running number to preserve the original series.

156,536 / 770 = 203 subscriptions on each title.

It is difficult to determine how many different "people" we should create on FactGrid. The present set gives the entries as they are stated in the various lists. We should note each subscription on the person's item with that original statement (Property:P35 is designed for this) and add a Property:P499 series number so that one can recreate any list.

I played with the data for a day, trying to find out how many double records we might have in this set. The individual lists are without double records, but the entire set has massive numbers of recurring entries, basically in three categories:

  1. people we can spot wherever they appear as they are as famous as the individual members of the royal family
  2. entries that want to make it impossible to spot the individual customer (under statements like "A Gentleman in Oxford").
  3. people where we get the same name and status information without being able to spot a particular person in the foreseeable future
  4. entries that make it difficult to spot double records thanks to the load of variants used by the different publishers ("The Rt. Hon." can also be spelled out as the "Right honourable" or can occur as "The Rt. Honble", some publishers start with the title, others with the family name etc.)

I ran several standardisations (like standardising the "The Rt. Hon."), eliminated variables (the numbers of copies ordered in each). The numbers of potential double record rose considerably with each step. My present still only moderately harmonised list brings the 156,636 subscriptions down to 122,000 unique statements. At least 34,357 say exactly the same just with other abbreviations or another arrangement of the same. We could go for the smallest number, the number of potentially the same as only this move will spot the frequent customers within the set. Are the frequent customers simply better off? Are they professional musicians? We will in this process create items like "A lady" who buy much as they are actually numerous ladies, but we will be able to set a particular statement on them to single them out - a "might be more than one person" statement is available on FactGrid for that purpose).

The technical challenge: transform the list of all subscriptions into a full structured data version

As the individual entries provide all the information we can get, and that is not too much information for most of the entries, we should aim at a complete digest: All the components should be standardised to their respective maxima - without any loss of content information. The standardisation will create standardised labels and supply the information for the various statements on each item.

The question is how to we get entity recognition done with a manageable amount of work.

  • Using spreadsheet filters is doable and might actually not be the worst option. One would create a list to comb through and one would gradually extract information from this list, until it is totally dissolved. A first extraction will bring all the Mr. Mrs. Miss statements into one column (and out of that list), the next move will extract and remove the "right honorable" statements and so on, until we only have the names left - the moment in which we will separate initials and given names from family names. Dissolving the original list into some ten columns of potential components will allow us to bring them into a standardised arrangement for the labels and it will give the statements we can immediately make on each item.
  • Script (but how?) commands that manage to go into the previous line to fetch from there the first word or formula that is here replaced by a dash.

What FactGrid should be good at

FactGrid should be good at offering all these entries to a gradual and collective identification of people in this set. On the one hand we will learn a lot about known people acting here as customers. On the other hand we will get increasingly complex pictures of the respective audiences of these publications: We will be able to give age profiles, sociometrical statistical data ideas of generations behind the changes in taste that go through the preiod.

FactGrid allows the creation of items that will not fill an article and it is open to collective work, inviting identifications over the next years, so the software's promise.

To facilitate identifications we should create items with all the information we have as provided by the entries:

  • The Musical Society at the Castle Tavern in Pater—Noster—Row
  • Rev. Mr. Dovey, Prebend of the Cathedral at Lichfield

The unifying step must come first, so that we will aggregate information as far as we can on all the individual items before we send out invitations to identify people in these lists.

It will be interesting to contact important mailing lists and we should try to reach out into the sphere of genealogists who might be able to spot ancestry in these lists. --Olaf Simons (talk) 00:00, 26 October 2021 (CEST)