FactGrid:Data modeling

From FactGrid
Revision as of 14:22, 12 August 2020 by Olaf Simons (talk | contribs) (Location)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The Problem of on-and off existence

Gotha's Lodge Item:Q10575 is a typical example: it has a starting point Property:P49 but actually several new starts and more than one end Property:P50. People found a lodge, the lodge is closed (in the events of the French Revolution), it re-opens in the national wave of the Napoleonic wars, it is closed again in 1935 and re-eopened in 1949 or later or never.

Is it always the same lodge? What do we do with changing names? The German National Library creates loads of items under ever new names - and you have to make sure to get the respective successions. The alternative: If those involved wanted to state that they actually are the same people (even if that is only a fiction of continuity) let them:

The present solution: We have started to use Property:P137 "History" and Item:Q94446 "active phase":

See Item:Q10575 Lodge "Ernst zum Compaß", Gotha

We can now use qualifiers on each active phase in order to state a respective beginning and a respective end. Is this a good option? How does it work in SPARQL searches if you want to give a time line? How does the option agree with the general P49/P50 use on items?

--Timeline is fine. Try for example Gotha's Lodge between 1760 and 1830. No problem either with the general P49/P50 (you can use OPTIONAL for the history) --Bruno Belhoste (talk) 18:46, 14 May 2020 (CEST)

Beautiful. You know how to script these things. (And you should write a blog post one of these days about practical tips - like how to get open refine connected or how to best learn scripting these things... --22:18, 14 May 2020 (CEST)
Just tested it with all the lodges that have these markers. Turns out that the search does not get the continuities. I added labels to show that.
The visualisation packs the histories without understanding the continuities. --Olaf Simons (talk) 07:24, 15 May 2020 (CEST)
yes, there is a problem. I don't see any possibility to solve it because the way the Timeline module of the query service put the data in the layout is to maximize the compacity and there is no way to constrain it to align data by using any criterion; the only solution is to use another Timeline module with more features.--Bruno Belhoste (talk) 12:12, 15 May 2020 (CEST)

Changing names

An organisation can run through dozens of name changes over the years - e.g. an Early Modern Publishing house with name changes whenever a father hands down the business to son, wife, son in law etc.

We are presently using Property:P57 for a history of naming but that is not ideal since usual searches will not get to the the right names at the right moment.

It is a major question. I am not convinced by using Property:P57 for the reason you give. In my view, as historians, we have to give information which is as close as possible to the sources and to the actors themselves. It means that we have to name organizations as they were named in the sources. It is possible that the same organization has two or three different names at the same time. We have to choose one of its names as the reference name and put the other ones as aliases (and maybe also as objects of the property P57 naming). But, in my view, in case the organization changed its name at certain times, we have to create different items. Let me take the example of the French Academy of Sciences. It was an Old Regime institution called Académie royale des sciences Item:Q153559, then during the Revolution it became the First Class of the National Institute Item:Q153578, and finally in 1816 again Académie des sciences Item:Q153579. Is it the same institution? Maybe yes, maybe no, but, from the point of view of the historians it is much better, I think, to consider them as different institutions, at least as a basic statement. But the condition is to link these different institutions to make a chain, which will be the "intemporal institution". I propose to use Property:P6 and Property:P7 to create this chain. Suppose you want to consider in a query not the Old Regime Academy Item:Q153559 but the Academy of Sciences in the longue durée. You can create the variable ?Academy in the clause WHERE with the triple: Q153559 wdt:P7* ?Academy. If you use ?Academy in another clause you can retrieve information concerning the Academy of sciences from its foundation in 17th century to today; if you want to consider only the First Class of the Institute and the modern Academy of Sciences you can create the variable ?Academy in the clause WHERE with the triple: Q153578 wdt:P7* ?Academy, and so on. Try [1] and [2].--Bruno Belhoste (talk) 14:22, 15 May 2020 (CEST)



Ernst Howald and Henry E. Sigerist. Antonii Musa De herba vettonica ... Leipzig 1927
We created two properties for this: Property:P233 names the object - a book edition, a manuscript or any other thing that is genetically earlier. Property:P234 comes as the qualifier and offers a statement on what basis the object can be seen as a following. You might for instance link a translation to the edition that gave the original text.

The organisation is top down chronological (the guide lines in the picture above are not that beautiful, but dates on y-axis would be cool).

Objects can have multiple connections to earlier Items (a medieval scribe could use two books to create a new version of the text).

It would be cool if the P234 information became available on the lines that are connecting items.

One of the problems is here also: How do I select a family of items?

The situation at a particular point in time

Think of a house: Tenants are moving in and out - we can model that with Property:P239 "resident" and P49/50 qualifiers. You are now interested in the situation at a point in history: Who were the tenants on March 3, 1848?

If we can get this done for a house like Item:Q14572 we might be able to show a city at a point in time.

Organisational ties

Originally we thought we should use The Wikibase advantage of being able to create any imaginable statement to do just that. A lodge has a "Mother lodge", so do create a link to that (with the daughter lodge respective property). It can chose an umbrella organisation, it will accept an obedience (adhering a system) etc.

The problem of these specific properties

  • is that a neighbouring organisation might have its own nomenclature for pretty much the same dependencies - or that it can have the same terms, but mean something different with them.
  • that you need to know the specific terminology in order to run a query

The general solution could be a standard option like "organisational ties" and use that property to specify them with the quakifiers for the specific tie.

A more specific solution can lie in between these poles: We create a pattern of general types of these ties, so that we can then ask broad questions in order to see different networks.

  1. is owned by / owns
  2. parent organisation / subsidiaries
  3. received the patent from / granted patents to
  4. represented by / representing
  5. member of / members
  6. partner organisations
  7. organisationally supported by / supporting with organisational help
  8. financed by / finances
  9. recognised by / recognises
  10. next hierarchical level level above / next hierarchical level underneath

One would now use qualifiers to give the exact terminologies. Which of these are redundant? which of the are missing? #

Is there a better option?

Data and Metadata

It is important to carefully distinguish between data and metadata. A data is a specific entity; a metadata is a data that provides information about other datas; it is also called a class. An entity is defined as a data by P2:instance of, and as a metadata (class) by P3:subclass of. P2 and P3 are exclusive: it means that a data cannot be at the same time an "instance of" and a "subclass of" the same data.

Defining classes (and subclasses) improves queries tremenduouly.

For example: "University of Erfurt" (Q11263) is "an instance of" (P2) "university" (Q11307), which is "a subclass of" (P3) "higher education institution" (Q144732), which is "a subclass of" (P3) "education institution" (Q160273), which is "a subclass of" (P3) "organisation" (Q12).

If you want to get only the universities (including the University of Erfurt), you make the simple query:

?universities wdt:P2 wd:Q11307

If you want to get all the education institutions (including the universities, and especially the University of Erfurt), you make a property-path query (using the slash /):

?educationInstitution wdt:P2/wdt:P3* wd:Q160273.

The star * means that you go through all the subclasses of wd:Q160273.

In conclusion, use P3:"subclass of" and not P2:"instance of" when you create an item which is a metadata. Quite often the superclass of this subclass does not exist and you have to create it at the same time. It is a bottom-up way to develop the ontology of Factgrid. --Bruno Belhoste (talk) 15:55, 20 May 2020 (CEST)

A cohesive data model for people (and careers)

We will need a more cohesive data model for people, especially to note positions, employments, offices held.

I started with my own CV as I felt I could handle this without deeper recourse to complex data models

  • Olaf Simons - a CV with modern information about employments

Our Illuminati biographies required greater attention to membership and offices or ranks these people held:

  • Christian Georg von Helmolt - a CV with extensive information about (masonic offices) held by von Helmolt, note here the combination of P266 offices held with qualifier P267 organisational context.

The third wave of biographies derived from the Thüringer Pfarrerbücher wanted to be sychronised with information that appeared on the different pastorates. Note in both Items the use of positions and

Barbara Kröger and Christian Popp of the Germania Sacra project proposed this arrangement for

  • Heinrich Belitz - note here the use of the career statement and the use of P91 (membership) as a qualifier.

My proposal would be to use P91 in the first level triples, as they give us valuable information about the organisational networking. If we bring Belitz life into the Christian Georg von Helmolt pattern we will create a CV like

It is essentially irrelevant how we do it - and to some extent merely a question of transformations that will be more or less easy. We would have to transform some 2000 CVs of pastors to state that they were pastors in several places. The ugly part is that much of this would become manual work: Wikibase does not like a QuickStatement input with repetitive statements: If a person is a pastor in 4 different positions the automatic input will state once that he was a pastor, it will then list the positions - and, terrible, all the begin dates and all the end dates in two separate heaps of no further use.

It is clear on the other hand why one might not like to state that person X was employed at Abbey Y in the position of an abbot.

We can pragmatically allow all these statements side by side. The software is just collecting triples and none of these triples is wrong in itself. The problem is here basically that the same searches will not work throughout the database's range - which will become nasty if ever we manage to run a simple search interface with a few standard input fields on the database.

A simple way to set a compromise is to slightly rename certain Properties to make them work from the 12th century abbot to the 21st century employee.

A side remark on the career statement option: I have used this Property with the hope to win a specialist on professions who would use her own database to revise and differentiate this mess with her own interest in jobs. That is basically why I hesitated to turn that very open statement into a base. (The second reason I mentioned above: the automatic input would produce a lot of work once we began to use repetitive statements of the same with ever changing qualifiers.)

An exemplary CV to serve als the lives I have just mentioned would be welcome. --Olaf Simons (talk)

Bruno Belhoste

I think it is important that all FactGrid projects adopt the same basic model for the actors (individuals but also institutions) to make the queries easier. Obviously, each project can propose extensions to this basic model according to its needs.

Career modeling is a delicate point. As Olaf explains, it is exactly the same to state the offices held by a person and then, as qualifiers, the institutions concerned, or to state the institutions where the person is active and then, as qualifiers, the offices held. However, it is a major change for queries. Since Olaf opted for the first solution, it is clear that all projects should adopt it.

Olaf points out a serious problem wich occurs in all cases: when you make an input by Quickstatement, you cannot repeat the same statement twice by varying the qualifiers, because then these qualifiers are attributed to both statements indiscriminately. For example, Georg Eckolt (Q41800) is at Pastorate Gräfentonna first as a deacon and then as a pastor, but the start and end dates are not distinguished.

The data can be entered by hand, but this can be very tedious. In my opinion, this is not the best solution. In fact, I think that there is a problem of data modeling. For each person, each occupation must be unique. This means, for example, that Georg Eckolt is not a pastor twice, first in Emleben and then in Gräfentonna; he is a pastor (general declaration, which is not mandatory), a pastor in Emleben (second declaration) and a pastor in Gräfentonna (third declaration). Two items must therefore be created: "pastor in Emleben" and "pastor in Gräfentonna", which are two instances of the item "pastor" (which is a class) with the locations Emleben and Grätentonna respectively (by the way, notice that Friedrich Schösser (Q41748) is also "pastor in Grätentonna"). It will solve all the problems with qualifiers. --Bruno Belhoste (talk) 18:15, 23 May 2020 (CEST)

Sources with scans

On unsolved question for me is the indication of source information including scans of these sources. It is important that the specific part of the scan where an information comes from can be presented to the user. One idea is to use "Personas", i.e. occurrences of a person in some kind of source. That Persona item could be linked to a (scan of a) page with the position recorded as qualifiers of the statement. The page again is linked to a higher source (book, section of a book, or a volume). Here is a diagram showing the idea:

+------------------------+                   |instance of:  Page                            |
|instance of: Persona    |                   |scan:         https://digibib.genealogy.net/… |
|family name: Opitz      +------------------>+page name:    II-23                           |
|given name:  Johann     |  scan             +--------------+-------------------------------+
|occupation:  Freigärtner|  y position: 300                 |   is part of
+------------------------+  x position: 53                  |   sequence number: 75
                                             |instance of:      Book                                    |
                                             |title:            Adressbuch für den Kreis Hirschberg 1927|
                                             |scan of cover:    https://digibib.genealogy.net/…         |
                                             |link to digibib:  https://digibib.genealogy.net/…         |

I could create an example in FactGrid but I would need a couple of the Properties for that.


There are two ways to locate an item: with the property property:P47 and with the property property:P83. The normal way is to use property:P47; property:P83 is is used to indicate the location of residence of a person. Unfortunately there are cases where the choice between property:P47 and property:P83 is less obvious. For instance, how to locate a library? by property:P47 or by property:P83? In some cases, property:P83 is used (see for instance item:Q11268), and in other cases property:P47 (see for instance item:Q38644). To avoid such inconsistencies, I propose to limit the use of P83 to person-items. For locating organizations, P47 should be systematically used. The description of the properties should make this clear for the users.--Bruno Belhoste (talk)

The Properties have their developments (which does not excuse the problems this creates). property:P47 began as a property to state the location of Books - the answer was an archive or library with a qualifier of a shelf-mark. We then needed addresses for these institutions and felt that should require a Property like property:P83 for Persons, companies, libraries, archives located in towns (so that we run searches that immediately go to the geo-coordinates of these places: property:P83 links to a town etc. with coordinates. To make things worse we became even more specific with adresses: Property:P434 names a property, a house or an address with coordinates.
The logic behind the differentiation is not from where are we linking but with what specifity we are linking. The advantage is that you run unified searches: Are two people in the same town? You do not get that if you ask for addresses. You get it with P83. Do they live in the same house? Ask with P434. Do you just want to state a location use P47.
The idea was the search that tells me something about shared locations and about the intensity of connections. --Olaf Simons (talk) 14:23, 11 August 2020 (CEST)

I had to take a closer look - and have mixed feelings about merging. The property to go to would be property:P47 as this is often used in qualifiers. I am just not sure whether we will not suddenly realise that we need a property for the "place of home address" Property:P83 as this is a complex thing, both for people and companies registered at places. People have home addresses though they might be travelling around for much of the year. It is the town where you have citizen's rights even if you are allocated on ever changing places on campaigns of your regiment our journeys of your theatre company. So here I felt we need a split between "place of home address" and the numerous temporal stays Property:P296 which we might like to map separately (there is even a research stay Property:P351). Once we delete Property:P83 we will open something new like address of headquarter or place of citizen rights.

The original property:P47 was called "location" and we used it mostly to reference files in archives. We moved that particular use to Property:P329 referring to institutions and needed a new word for property:P47 which would be open but different from the place of home address Property:P83.

Property:P434 came into being on items as Item:Q1654 where I realised I could be far more specific. I did not want to remove the unspecific place reference as this is the one that allows us to ask for events that took place in - in this case Gotha - on a timeline without having to know the address. If we go for specific references in all cases we will always need queries with a marker of radius around a specific point to get all the events that were available in the evening calendar of this place. That is why I did not just delete the the unspecific Gotha reference to state the "better" and more specific house reference.

I would rather go for better wordings of these properties and for exact descriptions to work as the manual.

  • Property:P47 goes into the direction of "where did this happen?"
  • Property:P83 states the commune of local citizen rights, the municipality that receives direct taxes from a company having its headquarters here, the commune that registers a person or organisation.
  • Property:P434 refers to an address such as Item:Q15668 (it might make sense to change the P47 statement on this item to a P83 statement in order to reference the municipality that is harboring this address).
  • Property:P296 stay in to note temporal stays
  • Property:P351 to note research stays as these are a particular thing in the CV of a scholar.
  • Property:P329 to state the institution (library, archive, museum) that has the item in stock (can be more than one mentioning in case of an edition with numerous copies to locate in different libraries).

If we delete properties this will be a loss of differentiation even if we have not been fully consistent so far. The alternative could be running two properties on one item in particularly doubtful cases — and generally: an attempt to be more consistent where we have failed. We should only merge if this does not produce weird search results. --Olaf Simons (talk) 11:03, 12 August 2020 (CEST)

I agree with this approach. However, I see a problem with Property:P47, which is often used to locate a person, especially as a qualifier (see for instance Item:Q44 at the statement Property:P165; it is not an isolated case). In these cases, Property:P47 is clearly not used to locate an event. Would it be possible to shift these specific cases of Property:P47 to Property:P83? Another point: in French the label of Property:P83 should be "localisation" (="location"), because there is no word which can be applied both to the residence of a person and to the place of an organization. --Bruno Belhoste (talk)
query P47 "where?" used as qualifier of P165 --Bruno Belhoste (talk)
The last search is particularly nasty. It notes the place of registry for Illuminati. Sometimes these coincide with the existence of a Minerval Church - and here I should note the Minerval Church under membership. Other members built local groups without meeting - but under supervision as a local group (like Göttingen). I eventually decided to wait for the research grant that will allow me to bring all the Illuminatio files online - freshly digitised. I will use it to also run through vol 10 if the Schwedenkiste (in Moscow) - and then to give a better listing. I have private scans and that is a messy business one can only do with the support of a data base. --Olaf Simons (talk) 14:22, 12 August 2020 (CEST)