FactGrid talk:Subscription lists: Difference between revisions

From FactGrid
Jump to navigation Jump to search
(Created page with "== British Music Subscriptions == The present list has 156,536 lines, each a subscription - usually one copy, sometimes more copies. We could create an item per subscription and then create (almost) as many items for the individual subscribers with almost identical information. The alternative would be a process of entity recognition. It would here be interesting to separate the following groups: # Subscriptions that provide enough information to allow a provisional/act...")
 
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
== British Music Subscriptions ==
== British Music Subscriptions ==
The present list has 156,536 lines, each a subscription - usually one copy, sometimes more copies. We could create an item per subscription and then create (almost) as many items for the individual subscribers with almost identical information. The alternative would be a process of entity recognition. It would here be interesting to separate the following groups:
The present list has 156,536 lines, each a subscription - usually one copy, sometimes more copies. We could create an item per subscription to then create some 100,000 items for the individual subscribers that could be assumed as agents behind the subscriptions. The tentative isolation of these agents would be necessary to make more definite statements about the customers. How many one-time customers do we have in this set? How many addicted music fans are there? How many men, hoe many women, how does the audience change in the course of the century. What are the statistics for professions?


# Subscriptions that provide enough information to allow a provisional/actual identification
In the end we would have 156,536 subscriptions connected to may be 90.000 customers - a mass of items with almost identical amounts of information on each of the corresponding items.
# Subscriptions that are essentially generic: Y96894: A Lady.
 
It is therefore preferable to take the step towards an identification of customers before the input and to then focus on the customers in all ensuing work. The subscriptions will now just become information on the individual customers. Some customers are famous like the members of the royal family. Other customers are functional generic shrouds with statements like "A Lady".
 
The identification of big shots like the King in a given year or like George Frederick Handel or of smaller shots like university members at Oxford and Cambridge or book sellers in various cities will be later the work for researches.
 
A fist identification should stay just in the set. We should try to understand who appears in these 156,536 subscriptions again and again. Here we need a tentative internal identification procedure under the set of rules. Someone is the same with the same name place and profession. In other case we may assume with a high degree of certainty that this is a customer whom we already noted in a pervious subscription. No customer will appear twice in the same subscription (any such customer would be listed once with several "sets" ordered).
 
* Step 1 would be a segmentation of all the information
* Step 2 would be a tentative creation of the total number of individual customers separated in an internal isolation of probably individual agents.
 
=== Segmentation ===
The information of each subscription can be split under the following headers:
* Family name
* Given name
* Profession
* Status (Mr., Miss, Esq. King, Hon[oura]ble., etc.)
* Gender (to be inferred from status and given name (FactGrid has a name list with gender attributions)
* Organisation (choir, shop, club etc.)
* Organisation noted with individual customers
* Customer's place of address
* Number of sets ordered
* Payment
:* amount
:* unit
* Publication (col. E)
* Date of publication (col. F)
* Place of publication (col. G)
* Publisher (col. H)
 
The publications (E) are already identified and have Q-numbers leading to the background details, I can provide (F-H).


=== Subscriptions that provide enough information to allow a provisional/actual identification ===
=== Subscriptions that provide enough information to allow a provisional/actual identification ===
Most of the subscriptions come with plausible identifiers: A family name, gender (Miss, Mr., Seignor...) a given name, a place name, a profession. Some of the subscribers are organisations or companies (choirs, booksellers), most of them are individuals in the full range from "The King" and "George Frederic Handel, Esq;" to "Lady Brown" (Y3536) who might (or might never ever) become identifiable in the hands of specialists.
Most of the subscriptions come with internal identifiers that made sense back when the individual orders were registered - the subscribers wanted to be known as connoisseurs and supporters, that is why they agreed to see their names entered in a register introducing the respective publication.
 
It would be good to accumulate and to sort the information before we feed the data into FactGrid. Creating items one by one is painful. Merging items - thousands of items manually - is even more painful. The ideal data set is already an interpretation, though a transparent interpretation with indications of potential "Merger candidates" or with warnings that a separation of data might be necessary.


FactGrid knows 18th-century composers (noted in Wikidata) but the British posopographic dataset is mor or less empty.
It is hence possible to individualise a high percentage of the subscribers with the help of the identifiers stated.


The interesting matching process will focus on identities that should be created among the 156,536 subscriptions.
We could generate procedural recommendations once the split of information was done:
The optimal solution is, however, a plausible and transparent set of identities to create under rules such as:


* 90% match - if names, professions and places are the same within three decades
* 100% recommendation to create the person/organisation if it appears only once in the data set under the given name.
* 80% match - if full names and places are the same within three decades
* 90% recommendation to attribute several subscriptions to the same person/organisation - if name, profession and note of address are identical within a range of three decades
* 75% match - if last names and places are the same within three decades
* 80% recommendation... - if full names and places are the same within a generation
* 70% match - if status, name and place are matches
* 70% recommendation... - if status, name and place are matches are the same within a generation
* 65% match - if full names and professions are the same within three decades (though places differ)
* 65% recommendation... - if full names and professions appear again in a generation
* 30%-65% match - "manual" look requested
* etc.
* < 29% simple item creation is proposed though with identification of possible matches


"Manual" scans of proposals would be assisted by subsets that present the data sorted by name, place, status or profession (with the respective dates of publication). One could in this case go through all candidates from York or Bath and decide with human knowledge.
Lower recommendations could come with a call to "take a look" through potential matches. The previous segmentation will make it easy to create any subset for researches to run through under questions such as: show me all customers in Glasgow through the century (so that I can spot possible repeated subscribers in this particular set).

Latest revision as of 18:13, 31 July 2024

British Music Subscriptions

The present list has 156,536 lines, each a subscription - usually one copy, sometimes more copies. We could create an item per subscription to then create some 100,000 items for the individual subscribers that could be assumed as agents behind the subscriptions. The tentative isolation of these agents would be necessary to make more definite statements about the customers. How many one-time customers do we have in this set? How many addicted music fans are there? How many men, hoe many women, how does the audience change in the course of the century. What are the statistics for professions?

In the end we would have 156,536 subscriptions connected to may be 90.000 customers - a mass of items with almost identical amounts of information on each of the corresponding items.

It is therefore preferable to take the step towards an identification of customers before the input and to then focus on the customers in all ensuing work. The subscriptions will now just become information on the individual customers. Some customers are famous like the members of the royal family. Other customers are functional generic shrouds with statements like "A Lady".

The identification of big shots like the King in a given year or like George Frederick Handel or of smaller shots like university members at Oxford and Cambridge or book sellers in various cities will be later the work for researches.

A fist identification should stay just in the set. We should try to understand who appears in these 156,536 subscriptions again and again. Here we need a tentative internal identification procedure under the set of rules. Someone is the same with the same name place and profession. In other case we may assume with a high degree of certainty that this is a customer whom we already noted in a pervious subscription. No customer will appear twice in the same subscription (any such customer would be listed once with several "sets" ordered).

  • Step 1 would be a segmentation of all the information
  • Step 2 would be a tentative creation of the total number of individual customers separated in an internal isolation of probably individual agents.

Segmentation

The information of each subscription can be split under the following headers:

  • Family name
  • Given name
  • Profession
  • Status (Mr., Miss, Esq. King, Hon[oura]ble., etc.)
  • Gender (to be inferred from status and given name (FactGrid has a name list with gender attributions)
  • Organisation (choir, shop, club etc.)
  • Organisation noted with individual customers
  • Customer's place of address
  • Number of sets ordered
  • Payment
  • amount
  • unit
  • Publication (col. E)
  • Date of publication (col. F)
  • Place of publication (col. G)
  • Publisher (col. H)

The publications (E) are already identified and have Q-numbers leading to the background details, I can provide (F-H).

Subscriptions that provide enough information to allow a provisional/actual identification

Most of the subscriptions come with internal identifiers that made sense back when the individual orders were registered - the subscribers wanted to be known as connoisseurs and supporters, that is why they agreed to see their names entered in a register introducing the respective publication.

It is hence possible to individualise a high percentage of the subscribers with the help of the identifiers stated.

We could generate procedural recommendations once the split of information was done:

  • 100% recommendation to create the person/organisation if it appears only once in the data set under the given name.
  • 90% recommendation to attribute several subscriptions to the same person/organisation - if name, profession and note of address are identical within a range of three decades
  • 80% recommendation... - if full names and places are the same within a generation
  • 70% recommendation... - if status, name and place are matches are the same within a generation
  • 65% recommendation... - if full names and professions appear again in a generation
  • etc.

Lower recommendations could come with a call to "take a look" through potential matches. The previous segmentation will make it easy to create any subset for researches to run through under questions such as: show me all customers in Glasgow through the century (so that I can spot possible repeated subscribers in this particular set).