FactGrid talk:Subscription lists: Difference between revisions

From FactGrid
Jump to navigation Jump to search
No edit summary
Line 1: Line 1:
== British Music Subscriptions ==
== British Music Subscriptions ==
The present list has 156,536 lines, each a subscription - usually one copy, sometimes more copies. We could create an item per subscription and then create (almost) as many items for the individual subscribers with almost identical information. The alternative would be a process of entity recognition. It would here be interesting to separate the following groups:
The present list has 156,536 lines, each a subscription - usually one copy, sometimes more copies. We could create an item per subscription and then create an estimated 100,000 items for the individual subscribers (with a full repetition of the connections (which we would then have to keep in synchrony in all future changes)).


# Subscriptions that provide enough information to allow a provisional/actual identification
The alternative would be a process of entity recognition that focuses on the individual subscribers. We should separate the agents into two groups:
# Subscriptions that are essentially generic: Y96894: A Lady.
 
# Subscribers that provide enough information to allow a provisional/actual identification
# Subscribers that are essentially generic: Y96894: A Lady.
 
Most subscribers will be type 1.


=== Segmentation ===
=== Segmentation ===
The information of each subscription can be split under the following headers:
* Family name
* Family name
* Given name
* Given name
* Profession
* Profession
* Status (Mr., Miss, Esq. King, Honble., etc.)
* Status (Mr., Miss, Esq. King, Hon[oura]ble., etc.)
* Gender (to be inferred from status and given name (name list with gender attribution can be provided)
* Gender (to be inferred from status and given name (FactGrid has a name list with gender attributions)
* Organisational context (shop, association, club, choir)
* Place of address
* Number of sets ordered
* Number of sets ordered
* Payment
* Payment
:* amount
:* amount
:* unit
:* unit
* Publication already identified in col. E
* Publication (col. E)
* Date of publication in col F (I will provide that --[[User:Olaf Simons|Olaf Simons]] ([[User talk:Olaf Simons|talk]]) 17:09, 30 July 2024 (CEST))
* Date of publication (col. F)
* Publisher (I will provide that --[[User:Olaf Simons|Olaf Simons]] ([[User talk:Olaf Simons|talk]]) 17:09, 30 July 2024 (CEST))
* Place of publication (col. G)
* Place of address
* Publisher (col. H)
* Organisational context (shop, association, club, choir)


Information is rarely given on all headings and individual statements (and abbreviations) can vary  
Information is rarely given on all headings and individual statements (and abbreviations) can vary  


=== Subscriptions that provide enough information to allow a provisional/actual identification ===
=== Subscriptions that provide enough information to allow a provisional/actual identification ===
Most of the subscriptions come with plausible identifiers: A family name, gender (Miss, Mr., Seignor...) a given name, a place name, a profession. Some of the subscribers are organisations or companies (choirs, booksellers). Most of them are individuals in the full range from "The King" and "George Frederic Handel, Esq;" to "Lady Brown" (Y3536) who might (or might never) become identifiable in the hands of specialists.
Most of the subscriptions come with identifiers that did the job back in the 18th century: A family name, gender (Miss, Mr., Seignor...) a given name, a place name, a profession. Some of the subscribers are organisations or companies (choirs, booksellers). Most of them are individuals in the full range from "The King" and "George Frederic Handel, Esq;" to "Lady Brown" (Y3536) who might (or might never) become identifiable in the hands of specialists.


It would be good to aggregate information before we feed the data into FactGrid. Creating items after a vague proposal is painful - it is better to create 130,000 in a coulpe of session. Merging items - thousands of items manually in an alternative step two - is equally painful. The ideal data set is already an interpretation of entities that should be kept apart, but these should come with transparent interpretation indications of potential "Merger candidates" or with warnings that a separation of data might be necessary.
It would be good to aggregate information before we feed the data into FactGrid. Creating individual items in later work will be painful; merging thousands of items will be equally painful. The ideal data set is already an interpretation of entities that should be kept apart, but these should come with transparent indications of potential "Merger candidates" or with warnings that a separation of data might be necessary.


FactGrid knows 18th-century composers (noted in Wikidata) but the British posopographic dataset is more or less empty, so that it remains easy to match proposals agains the present database.
FactGrid knows 18th-century composers (noted in Wikidata) but the British posopographic dataset is more or less empty, so that it remains easy to match proposals against the present database.


The interesting matching process will focus on identities that should be created among the 156,536 subscriptions - due to the fact that we have thousands of music lovers who by several titles.
The interesting matching process will focus on identities within the set due to the fact that we have up to 50,000 music lovers in this set who ordered several titles.
   
   
The optimal solution is a plausible proposal of the identities to create under rules such as:
The optimal solution is a plausible proposal of the identities to create under rules such as:

Revision as of 18:12, 30 July 2024

British Music Subscriptions

The present list has 156,536 lines, each a subscription - usually one copy, sometimes more copies. We could create an item per subscription and then create an estimated 100,000 items for the individual subscribers (with a full repetition of the connections (which we would then have to keep in synchrony in all future changes)).

The alternative would be a process of entity recognition that focuses on the individual subscribers. We should separate the agents into two groups:

  1. Subscribers that provide enough information to allow a provisional/actual identification
  2. Subscribers that are essentially generic: Y96894: A Lady.

Most subscribers will be type 1.

Segmentation

The information of each subscription can be split under the following headers:

  • Family name
  • Given name
  • Profession
  • Status (Mr., Miss, Esq. King, Hon[oura]ble., etc.)
  • Gender (to be inferred from status and given name (FactGrid has a name list with gender attributions)
  • Organisational context (shop, association, club, choir)
  • Place of address
  • Number of sets ordered
  • Payment
  • amount
  • unit
  • Publication (col. E)
  • Date of publication (col. F)
  • Place of publication (col. G)
  • Publisher (col. H)

Information is rarely given on all headings and individual statements (and abbreviations) can vary

Subscriptions that provide enough information to allow a provisional/actual identification

Most of the subscriptions come with identifiers that did the job back in the 18th century: A family name, gender (Miss, Mr., Seignor...) a given name, a place name, a profession. Some of the subscribers are organisations or companies (choirs, booksellers). Most of them are individuals in the full range from "The King" and "George Frederic Handel, Esq;" to "Lady Brown" (Y3536) who might (or might never) become identifiable in the hands of specialists.

It would be good to aggregate information before we feed the data into FactGrid. Creating individual items in later work will be painful; merging thousands of items will be equally painful. The ideal data set is already an interpretation of entities that should be kept apart, but these should come with transparent indications of potential "Merger candidates" or with warnings that a separation of data might be necessary.

FactGrid knows 18th-century composers (noted in Wikidata) but the British posopographic dataset is more or less empty, so that it remains easy to match proposals against the present database.

The interesting matching process will focus on identities within the set due to the fact that we have up to 50,000 music lovers in this set who ordered several titles.

The optimal solution is a plausible proposal of the identities to create under rules such as:

  • 90% recommendation - if names, professions and places are the same within three decades
  • 80% recommendation - if full names and places are the same within three decades
  • 75% recommendation - if last names and places are the same within three decades
  • 70% recommendation - if status, name and place are matches
  • 65% recommendation - if full names and professions are the same within three decades (though places differ)
  • 30%-65% recommendation - "manual" look requested
  • < 29% simple item creation is proposed though with identification of possible matches

"Manual" scans of proposals would be assisted by subsets that present the data sorted by name, place, status or profession (with the respective dates of publication). One could in this case go through all candidates from York or Bath and decide with human knowledge.