FactGrid talk:Subscription lists: Difference between revisions
Olaf Simons (talk | contribs) |
Olaf Simons (talk | contribs) |
||
(3 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== British Music Subscriptions == | == British Music Subscriptions == | ||
The present list has 156,536 lines, each a subscription - usually one copy, sometimes more copies. We could create an item per subscription | The present list has 156,536 lines, each a subscription - usually one copy, sometimes more copies. We could create an item per subscription to then create some 100,000 items for the individual subscribers that could be assumed as agents behind the subscriptions. The tentative isolation of these agents would be necessary to make more definite statements about the customers. How many one-time customers do we have in this set? How many addicted music fans are there? How many men, hoe many women, how does the audience change in the course of the century. What are the statistics for professions? | ||
In the end we would have 156,536 subscriptions connected to may be 90.000 customers - a mass of items with almost identical amounts of information on each of the corresponding items. | |||
It is therefore preferable to take the step towards an identification of customers before the input and to then focus on the customers in all ensuing work. The subscriptions will now just become information on the individual customers. Some customers are famous like the members of the royal family. Other customers are functional generic shrouds with statements like "A Lady". | |||
The identification of big shots like the King in a given year or like George Frederick Handel or of smaller shots like university members at Oxford and Cambridge or book sellers in various cities will be later the work for researches. | |||
A fist identification should stay just in the set. We should try to understand who appears in these 156,536 subscriptions again and again. Here we need a tentative internal identification procedure under the set of rules. Someone is the same with the same name place and profession. In other case we may assume with a high degree of certainty that this is a customer whom we already noted in a pervious subscription. No customer will appear twice in the same subscription (any such customer would be listed once with several "sets" ordered). | |||
* Step 1 would be a segmentation of all the information | |||
* Step 2 would be a tentative creation of the total number of individual customers separated in an internal isolation of probably individual agents. | |||
=== Segmentation === | === Segmentation === | ||
Line 16: | Line 20: | ||
* Status (Mr., Miss, Esq. King, Hon[oura]ble., etc.) | * Status (Mr., Miss, Esq. King, Hon[oura]ble., etc.) | ||
* Gender (to be inferred from status and given name (FactGrid has a name list with gender attributions) | * Gender (to be inferred from status and given name (FactGrid has a name list with gender attributions) | ||
* | * Organisation (choir, shop, club etc.) | ||
* | * Organisation noted with individual customers | ||
* Customer's place of address | |||
* Number of sets ordered | * Number of sets ordered | ||
* Payment | * Payment | ||
Line 27: | Line 32: | ||
* Publisher (col. H) | * Publisher (col. H) | ||
The publications (E) are already identified and have Q-numbers leading to the background details, I can provide (F-H). | |||
=== Subscriptions that provide enough information to allow a provisional/actual identification === | === Subscriptions that provide enough information to allow a provisional/actual identification === | ||
Most of the subscriptions come with identifiers that | Most of the subscriptions come with internal identifiers that made sense back when the individual orders were registered - the subscribers wanted to be known as connoisseurs and supporters, that is why they agreed to see their names entered in a register introducing the respective publication. | ||
It is hence possible to individualise a high percentage of the subscribers with the help of the identifiers stated. | |||
We could generate procedural recommendations once the split of information was done: | |||
* | * 100% recommendation to create the person/organisation if it appears only once in the data set under the given name. | ||
* | * 90% recommendation to attribute several subscriptions to the same person/organisation - if name, profession and note of address are identical within a range of three decades | ||
* | * 80% recommendation... - if full names and places are the same within a generation | ||
* 70% recommendation - if status, name and place are matches | * 70% recommendation... - if status, name and place are matches are the same within a generation | ||
* 65% recommendation - if full names and professions | * 65% recommendation... - if full names and professions appear again in a generation | ||
* | * etc. | ||
" | Lower recommendations could come with a call to "take a look" through potential matches. The previous segmentation will make it easy to create any subset for researches to run through under questions such as: show me all customers in Glasgow through the century (so that I can spot possible repeated subscribers in this particular set). |
Latest revision as of 17:13, 31 July 2024
British Music Subscriptions
The present list has 156,536 lines, each a subscription - usually one copy, sometimes more copies. We could create an item per subscription to then create some 100,000 items for the individual subscribers that could be assumed as agents behind the subscriptions. The tentative isolation of these agents would be necessary to make more definite statements about the customers. How many one-time customers do we have in this set? How many addicted music fans are there? How many men, hoe many women, how does the audience change in the course of the century. What are the statistics for professions?
In the end we would have 156,536 subscriptions connected to may be 90.000 customers - a mass of items with almost identical amounts of information on each of the corresponding items.
It is therefore preferable to take the step towards an identification of customers before the input and to then focus on the customers in all ensuing work. The subscriptions will now just become information on the individual customers. Some customers are famous like the members of the royal family. Other customers are functional generic shrouds with statements like "A Lady".
The identification of big shots like the King in a given year or like George Frederick Handel or of smaller shots like university members at Oxford and Cambridge or book sellers in various cities will be later the work for researches.
A fist identification should stay just in the set. We should try to understand who appears in these 156,536 subscriptions again and again. Here we need a tentative internal identification procedure under the set of rules. Someone is the same with the same name place and profession. In other case we may assume with a high degree of certainty that this is a customer whom we already noted in a pervious subscription. No customer will appear twice in the same subscription (any such customer would be listed once with several "sets" ordered).
- Step 1 would be a segmentation of all the information
- Step 2 would be a tentative creation of the total number of individual customers separated in an internal isolation of probably individual agents.
Segmentation
The information of each subscription can be split under the following headers:
- Family name
- Given name
- Profession
- Status (Mr., Miss, Esq. King, Hon[oura]ble., etc.)
- Gender (to be inferred from status and given name (FactGrid has a name list with gender attributions)
- Organisation (choir, shop, club etc.)
- Organisation noted with individual customers
- Customer's place of address
- Number of sets ordered
- Payment
- amount
- unit
- Publication (col. E)
- Date of publication (col. F)
- Place of publication (col. G)
- Publisher (col. H)
The publications (E) are already identified and have Q-numbers leading to the background details, I can provide (F-H).
Subscriptions that provide enough information to allow a provisional/actual identification
Most of the subscriptions come with internal identifiers that made sense back when the individual orders were registered - the subscribers wanted to be known as connoisseurs and supporters, that is why they agreed to see their names entered in a register introducing the respective publication.
It is hence possible to individualise a high percentage of the subscribers with the help of the identifiers stated.
We could generate procedural recommendations once the split of information was done:
- 100% recommendation to create the person/organisation if it appears only once in the data set under the given name.
- 90% recommendation to attribute several subscriptions to the same person/organisation - if name, profession and note of address are identical within a range of three decades
- 80% recommendation... - if full names and places are the same within a generation
- 70% recommendation... - if status, name and place are matches are the same within a generation
- 65% recommendation... - if full names and professions appear again in a generation
- etc.
Lower recommendations could come with a call to "take a look" through potential matches. The previous segmentation will make it easy to create any subset for researches to run through under questions such as: show me all customers in Glasgow through the century (so that I can spot possible repeated subscribers in this particular set).