User talk:Daniel Baránek: Difference between revisions

From FactGrid
Jump to navigation Jump to search
Line 124: Line 124:
:This is quite a fundamental question for me. My next project (if I get funding) will focus on the digitization of Jewish records. Although it doesn't seem like it at first glance, there is even less connection to personal items than in the case of the census. For example, in the birth records, the data relate not only to the child born, but also to the parents and grandparents, midwife, mohel and sandek. These other persons are often not immediately identifiable. More importantly, however, there is often additional information (residence, occupation) for all of these people in the records that would require a qualifier of qualifier if we were only creating personal items. For the complexity of the record, see e.g. [https://matrika.wikibase.cloud/wiki/Item:Q34 this entry] created in the pilot project.
:This is quite a fundamental question for me. My next project (if I get funding) will focus on the digitization of Jewish records. Although it doesn't seem like it at first glance, there is even less connection to personal items than in the case of the census. For example, in the birth records, the data relate not only to the child born, but also to the parents and grandparents, midwife, mohel and sandek. These other persons are often not immediately identifiable. More importantly, however, there is often additional information (residence, occupation) for all of these people in the records that would require a qualifier of qualifier if we were only creating personal items. For the complexity of the record, see e.g. [https://matrika.wikibase.cloud/wiki/Item:Q34 this entry] created in the pilot project.
:As you can see, I'm not dependent on FactGrid in any way. I don't need it for my data. I can use Wikibase Cloud or my own Wikibase instance. But I thought it was a great idea to create something like Wikidata for historians. Something that will also offer space for research based on the source criticism. I am able and willing to adapt in many things (flat modelling, how to use some properties etc.). However, if there is no room for items in the "Archival structure section", then unfortunately there is no room for my research either. I don't write that with any bitterness or as a threat. I respect that you have a policy for creating items and that many researchers are comfortable with it. I just want to say that not allowing source-based items wouldn't be comfortable for me and I would have to – easily – find another solution for publishing and sharing my data. --[[User:Daniel Baránek|Daniel Baránek]] ([[User talk:Daniel Baránek|talk]]) 21:57, 13 May 2024 (CEST)
:As you can see, I'm not dependent on FactGrid in any way. I don't need it for my data. I can use Wikibase Cloud or my own Wikibase instance. But I thought it was a great idea to create something like Wikidata for historians. Something that will also offer space for research based on the source criticism. I am able and willing to adapt in many things (flat modelling, how to use some properties etc.). However, if there is no room for items in the "Archival structure section", then unfortunately there is no room for my research either. I don't write that with any bitterness or as a threat. I respect that you have a policy for creating items and that many researchers are comfortable with it. I just want to say that not allowing source-based items wouldn't be comfortable for me and I would have to – easily – find another solution for publishing and sharing my data. --[[User:Daniel Baránek|Daniel Baránek]] ([[User talk:Daniel Baránek|talk]]) 21:57, 13 May 2024 (CEST)
You have hit the problem. We are planning to create a second Wikibase this summer. It is called the ClaimBase. Each claim gets an item. A claim can be a line in an address book of X-town: "Mrs Urbanek, widow of the late University professor, X-street No. 34." This gets an Item and then we link it to possible identifications that might have items on FactGrid or elsewhere on the ClaimBase.
Lucas Werkmeister who helps us from the Wikidata Developer team to run the installation has already said that he fears any such data model - an item per documented claim of reality will crash the ClaimBase very quickly under the weight of data. We have groups who want to import millions of lines from Address books without yet knowing who is who in all these books.
FactGrid is like Wikidata an encyclopaedic construct that (unlike Wikidata refers) to loads of primary documents that are analysed here for the first time in history.
FactGrid is offering research results - an interpretations of our findings, a representation of reality.
The ClaimBase will offer (potentially complex) statements of sources that need to be interpreted - and we have no clue whether it will actually work. If it will work then it will grow very fast with the masses of items any such model will rest on. A claim could be that ''there is an unidentified apartment in the city of so-and-so, house No. 34 in x-street with the following 6 names of people who need to be identified''.- --[[User:Olaf Simons|Olaf Simons]] ([[User talk:Olaf Simons|talk]]) 22:13, 13 May 2024 (CEST)

Revision as of 21:13, 13 May 2024

Slovakia is a blank area on our map....

Dear Daniel, I just realise that we have the Czech places on FactGrid, but not the localities of Slovakia. I have not checked whether the Wikikdata set would be good source - but I fear you will soon need these. --Olaf Simons (talk) 11:56, 2 May 2024 (CEST)

Dear Olaf, I will soon work on the localities. Will think about it. --Daniel Baránek (talk) 11:58, 2 May 2024 (CEST)
As inhabited places they should all have a P2-Q8 statement, simple but useful in order to get quick datasets of a country. If you want to specify the status you can do this with P560-Q#. --Olaf Simons (talk) 12:36, 2 May 2024 (CEST)

P131 Statements

...you should also have an P131-Item to represent your research Project on FactGrid. This is a good example: Item:Q11305. You can eventually use that to select all items you have been working on. --Olaf Simons (talk) 15:16, 2 May 2024 (CEST)

Thanks for the hint. --Daniel Baránek (talk) 15:24, 2 May 2024 (CEST)
I also created a project space which you can access through the project space menus. The other projects will give you ideas - but it seems you have not arrived without any ideas of your own. User:David Löblich has created a project on the Jewish History of Saxony-Anhalt and User:Michael Wermke, professor in Jena, is thinking of a project on Jewish schools in Germany, in which he wants to explore the networks of their teachers. --Olaf Simons (talk) 22:48, 2 May 2024 (CEST)

Great, thanks, I will write more about my projects little bit later. --Daniel Baránek (talk) 22:56, 2 May 2024 (CEST)

Thanks on my side, I did not realise that there were so many places without geocoordinates! I should have asked SPARQL.. --Olaf Simons (talk) 23:01, 2 May 2024 (CEST)
Just to let you know: I have upgraded you to Admin, so that you can run regular Quickstatements inputs etc. with greater ease. (Most of us are Admins here.) --Olaf Simons (talk) 08:01, 3 May 2024 (CEST)
Small thing: In German, English, French and Spanish we start all Labels with capital letters (with the exception of name components like "von Goethe" or company names that are explicitly formulated in lowercase letters. (the advantage is in regular searches the alphabetical order) --Olaf Simons (talk) 08:18, 3 May 2024 (CEST)

Houses

Dear Daniel, if you want to locate people in houses, please take a look at User:Laurenz Stapf's work on Leipzig to synchronise the data models. He again took a look at User:Bruno Belhoste's work on Paris. It is cool to create these data! --Olaf Simons (talk) 07:00, 7 May 2024 (CEST)

Thanks. I have created FactGrid:Buildings data model and invited both of them to comment it. --Daniel Baránek (talk) 12:46, 7 May 2024 (CEST)
Dear Daniel Item:Q923611 the P2-Residential building is only on your houses. All the others stated: Property:P2-Item:Q16200 to marke the entire ground that belongs to an address. --Olaf Simons (talk) 23:42, 9 May 2024 (CEST)
Well, this could be easily solved by Q701396:P3:Q16200, because every residential building is a real estate. --Daniel Baránek (talk) 23:49, 9 May 2024 (CEST)
Theoretically you are right. Do a wdt:P2/wdt:P3* (Instance of or Sub-property) search and you have it. The problem is that the evolving ontologies are precarious on both sides of the coin: People do not get the complete ontologies right. They create more and more precise items that are subclasses of something or other and assume they get a perfect ontology up to the top, but fail. They get things tat lead up to different top terms as they think that this or that should be on top of the ontology. In the end it becomes increasingly difficult to ask the right question in the mix of approaches within the pseudo categorical systems they created. (Graph databases are not very transparent here - I say this as a am a huge fan of graph databases.) The other problem is: the Query Service breaks sooner or later. It does not crawl through the ontology on bigger datasets. 30.000 items with just one unified P2-Q-value are easy to grab - you still get the whole city of Paris on the map. The same with a complex hierarchy of things and you get a time out. Wikidata does not provide good models of entire cities - as complex and large as User:Bruno Belhoste's Paris on FactGrid. We went to the technical limits and learned. (Besides: It does not make sense to have your data on extra models). --Olaf Simons (talk) 00:14, 10 May 2024 (CEST)

Thanks for clarifying. So, should I use P560, or is there any better property? --Daniel Baránek (talk) 07:57, 10 May 2024 (CEST)

P560 is a good property to state the status of an object. A project in Karlsruhe is about to use it for the various German "Denkmalschutz" categories of buildings. If you have registers explaining what these things are, that's a good property to replicate and source the the status assignments. --Olaf Simons (talk) 08:43, 10 May 2024 (CEST)

Territorial entities

are usually countries, regions, provinces... --Olaf Simons (talk) 00:01, 10 May 2024 (CEST)

Human settlements

(comprising several houses/ addresses) should all be P2-Q8.

Use Property:P560 to assign a status like "Swiss Canton" or "(German) Kreisstadt" --Olaf Simons (talk) 00:01, 10 May 2024 (CEST)

Thanks, now I know. I will remodel it. --Daniel Baránek (talk) 00:06, 10 May 2024 (CEST)
There are almost 40,000 entries (not created by me) which has P2:location and P2:municipality (which should be P560:municipality). Should we also remodel it? --Daniel Baránek (talk) 12:52, 10 May 2024 (CEST)

House number

You have linked Property:P152 to Wikidata's https://www.wikidata.org/wiki/Property:P4856 (not to https://www.wikidata.org/wiki/Property:P670). I wonder how to re-name that property properly (now that I have moved the former statements away) --Olaf Simons (talk) 09:24, 10 May 2024 (CEST)

Data model on Item:Q16200

I have now tried to document the best practice notes on Item:Q16200 - work in progress. Have a nice weekend. --Olaf Simons (talk) 09:54, 10 May 2024 (CEST)

Great. I see that you use P560 for culture heritage status which is ok. However, I was rather asking which property should be used for a statement, that the real estate is residential house/church/synagogue etc. Just another statement in P2? --Daniel Baránek (talk) 11:01, 10 May 2024 (CEST)

I would recommend P280 for that purpose. Greetings, David Löblich (talk) 11:20, 10 May 2024 (CEST)

Thanks. --Daniel Baránek (talk) 12:11, 10 May 2024 (CEST)

The entire data model

Dear Daniel, I am observing your input and I am wondering whether this is economical and whether it is the best path for the entire database project. If I understand you correctly you are creating items for all sections of records that refer to an apartment. The next step is probably to create items for all the apartments you are finding in these records, and then the houses in which the apartments are located and then all people (the latter two we would really like to have on board). We have never talked about the dimensions and the advantages of any such model.

All other groups discussed things beforehand with an interest in good data economy. With all the other teams we looked at data together and spoke about the options with an interest to think about solutions that can run big scale. The good solution is one that will work just as well if we have 400 projects doing things the same way. Laurenz Stapf's Leipzig data is in my eyes very economical - and I feel that your model will no do much more but it will produce three times the number of items without a visible advantage in searches. So maybe we could haven an online talk tomorrow (night) or Monday or so that I see the dimensions of the input you are planning and the data model you intend to use. Best wishes, --Olaf Simons (talk) 23:53, 11 May 2024 (CEST)

Dear Olaf, I apologize, I am busy with other commitments these days. I will have time to respond more thoroughly perhaps on Wednesday. Best, --Daniel Baránek (talk) 11:08, 13 May 2024 (CEST)
Dear Daniel, Wednesday is my most flexible day this week, I'll send you the video link via mail and you can give me your perfect time in return --Olaf Simons (talk) 13:33, 13 May 2024 (CEST)

Error creating thumbnail: Unable to save thumbnail to destination
Intended data structure

Dear Olaf, I had a little time to visualize the intended data structure. Please take a look at it, we can discuss it later in the call if it is still needed.

I completely understand your concerns about the data economy. However, I suppose there are other values involved in creating a "database for historians," most notably a strong emphasis on source criticism. A fundamental question is how to store – or model – data essential for research. For many of my research questions, the personal census record (item) is actually more fundamental for my research than personal item. Some examples:

  1. I need to compare the number of people in a particular demographic group, as reported by official statistics, with what can be found in extant census records (which are often different for various reasons). If I have census record items, I can create a simple query. However, if I don't have such items, I would have to create such a query in a complicated way via personal items and a complex qualifier structure that risks inconsistency. Actually, this goes back to the issue of "flat modelling" and of simple grabbing of items. If we were to create complex queries with many qualifiers, we would soon reach a timeout. Creating items for individual records allows you to flatten and simplify queries.
  2. I need to record various strikethroughs in the text. For example, only one language of daily use could be legally recorded in the census. However, many people reported multiple languages. Census commissioners then made people choose only one and crossed out the others. From the point of view of research on multilingualism, however, even these crossed-out records are very valuable. If there is a census record item, it will be easy to model this fact using the qualifier: P820 (object has role, or some better qualifier): strikethrough (or something like that). However, this is almost impossible to model in personal items because I can't create a qualifier of a qualifier. Moreover, the language has changed for persons over time. Thus, for individual personal items, I could have a complex structure like the following:
    • language skills (P460): German, date: 1890, reference: census 1890 + school records
    • language skills: Czech, date: 1890, object has role: strikethrough, reference: census 1890
    • language skills: Czech, date: 1890, reference: school records
    • language skills: Czech, date: 1900, reference: census 1900 + school records
    • language skills: German, date: 1900, object has role: strikethrough, reference: census 1900
    • language skills: Polish, date: 1900, object has role: strikethrough, reference: census 1900
    • language skills: German, date: 1900, reference: school records
    • etc.
    I think, it is obvious, that maintaining and querying such a structure is virtually impossible. Especially if we use QuickStatements for batching because it has the annoying feature that it adds qualifiers regardless of other qualifiers. So if we have a QS command P460|German|P820|strikethrough, it will add P820 to the first P460|German regardless of the date and other qualifiers.
  3. I need to know who was in the apartment with who. Yes, I can create a complex structure containing a large number of P47 (municipality, street, house number, apartment number). But this brings us back to creating complex queries, timeouts and the uselessness of QS for batching.
  4. Census records also contain information that are not directly relevant to the persons in question. For example, a person's relationship to the head of household is relevant to historical demographics, but not necessarily to the person themselves. The head-of-household - partner - children role is important, but not the nanny, employee, and certainly not the lodger. I don't see any reason why we should model the relationship of the nanny to the head of the household in personal items. Similarly, there is no point in storing various original notes in personal items that relate only to that record.
  5. Problem with misidentification of persons. If I have census record items, I can easily change the person to whom the record relates. However, if all the information was stored in personal items, it would be hard to separate the information between the correct and misidentified person.

In summary, I understand that the intended data structure may be perceived as largely redundant. However, I see no better solution than to create items within the "Archival structure" in the diagram. If we were to create entries only within the "Localities" and "Person" sections, it would make data management extremely complex, if not impossible. Best, --Daniel Baránek (talk) 17:32, 13 May 2024 (CEST)


The main thing that startled me is that we might not only get the apartments with all their tenants (both are extremely welcome) - but also additional items for each mentioning of an individual apartment in a dozen of census records. I have no clue about the numbers but let me just project fictional numbers. If you have 50.000 apartments, and if each apartment appears in only 10 census lists, that will be half a million items that will mostly just duplicate the information which you will also put on all the people mentioned in these documents.
In this case I would assume that you could get a leaner input with just 10 items for the ten census years (comprising the entire documentation) and use a string based reference system to refer to entries. In this leaner case you create 50.000 apartments which you localise on 10.000 houses ...and do it like Laurenz Stapf did it for Leipzig: state on the persons that will stand in the centre the apartment and the reference to the census record: Census 1900 (one Item), record 23.678 (one string reference). [I say this without knowledge of the data.]
The qualifier problem which you have mentioned is nasty (we all struggle with it) but then again you will not know exact begin and end points of each localisation. You seem to have only simple dates (P106) on census years.
I would prefer to think about this with a look at your documents and data and we should have Laurenz Stapf on board who is presently feeding taxation data on all the inhabitants of Leipzig into the machine - according to Leipzig's house and tax-registers. Here I know the primary documents, because Laurenz uploaded them to Wikimedia commons and integrated them as source information on each person and house. --Olaf Simons (talk) 18:06, 13 May 2024 (CEST)
Data Model

The problem with apartments is that they are rather fictional in my case. The apartments were not registered anywhere, they did not have a number on doors, the number was assigned to them rather randomly in a given census. At the next census, the same apartment could be assigned a different number, apartments could be rebuilt and connected. So it makes no sense to me to create items for apartments and to imitate Laurenz Stapf system (like Q545027).

To the numbers. I currently have data for:

  • Archival structure section: 247 × Q908266 (census records for municipality), 137 × Q919281 (CR for neighbourhood), 6445 × Q908267 (CR for houses), 10627 × Q908504 (CR for flats), 43075 × CR for individual people (without Q yet).
  • Localities and persons section: 4098 × Q16200 (houses), 33061 × Q7 (people).

Unfortunatelly, Laurenz Stapf's model does not suit the need of my research. If I take the random example of Q539688, there is an information about the amount of rent and tax using qualifier. Fine, but how would it be possible to provide information about a crossed-out record? As I said, such records are quite common in my research and have high informational value. How can I model this if I can't add a qualifier to the qualifier? --Daniel Baránek (talk) 18:42, 13 May 2024 (CEST)


Laurence's "apartments" are just as fictitious. The documents are locating tenants (families, juridical persons, people who share an apartment...) but they do not go beyond floors and indications like "main building" and "back house". Georg Fertig did a good deal of entity recognition to decide where we actually have the same people and continuities among apartments (with indications of dates when people left and when new people moved in.

Having said this I feel that it is actually a quite a reason against creating items for fictitious apartments that sound the same but might not at all be the same apartments in reality.

Without knowing your documentary evidence it seems more of a reason to reactivate Property:P603 and to name the people who shared apartments according to various sources. --Olaf Simons (talk) 21:07, 13 May 2024 (CEST)

However, this still does not solve the problem of crossed-out data.
As I think about it, this is actually about two types of historical scholarship production: source editions, and analysis-based knowledge. If we translate this to FactGrid reality, we can create items that match the sources (sort of source edition), and/or we can create items which are analysing, reconstruction and interpreting the historical reality ("Localities" and "Person" sections). Because personal or locality items are actually putting the image together from different pieces of primary sources. I would be quite sorry if FactGrid only gave space to the latter approach and didn't leave room for source based items.
This is quite a fundamental question for me. My next project (if I get funding) will focus on the digitization of Jewish records. Although it doesn't seem like it at first glance, there is even less connection to personal items than in the case of the census. For example, in the birth records, the data relate not only to the child born, but also to the parents and grandparents, midwife, mohel and sandek. These other persons are often not immediately identifiable. More importantly, however, there is often additional information (residence, occupation) for all of these people in the records that would require a qualifier of qualifier if we were only creating personal items. For the complexity of the record, see e.g. this entry created in the pilot project.
As you can see, I'm not dependent on FactGrid in any way. I don't need it for my data. I can use Wikibase Cloud or my own Wikibase instance. But I thought it was a great idea to create something like Wikidata for historians. Something that will also offer space for research based on the source criticism. I am able and willing to adapt in many things (flat modelling, how to use some properties etc.). However, if there is no room for items in the "Archival structure section", then unfortunately there is no room for my research either. I don't write that with any bitterness or as a threat. I respect that you have a policy for creating items and that many researchers are comfortable with it. I just want to say that not allowing source-based items wouldn't be comfortable for me and I would have to – easily – find another solution for publishing and sharing my data. --Daniel Baránek (talk) 21:57, 13 May 2024 (CEST)

You have hit the problem. We are planning to create a second Wikibase this summer. It is called the ClaimBase. Each claim gets an item. A claim can be a line in an address book of X-town: "Mrs Urbanek, widow of the late University professor, X-street No. 34." This gets an Item and then we link it to possible identifications that might have items on FactGrid or elsewhere on the ClaimBase.

Lucas Werkmeister who helps us from the Wikidata Developer team to run the installation has already said that he fears any such data model - an item per documented claim of reality will crash the ClaimBase very quickly under the weight of data. We have groups who want to import millions of lines from Address books without yet knowing who is who in all these books.

FactGrid is like Wikidata an encyclopaedic construct that (unlike Wikidata refers) to loads of primary documents that are analysed here for the first time in history.

FactGrid is offering research results - an interpretations of our findings, a representation of reality.

The ClaimBase will offer (potentially complex) statements of sources that need to be interpreted - and we have no clue whether it will actually work. If it will work then it will grow very fast with the masses of items any such model will rest on. A claim could be that there is an unidentified apartment in the city of so-and-so, house No. 34 in x-street with the following 6 names of people who need to be identified.- --Olaf Simons (talk) 22:13, 13 May 2024 (CEST)