FactGrid:Data modeling

From FactGrid
Jump to navigation Jump to search

The Problem of on-and off existence

Gotha's Lodge Item:Q10575 is a typical example: it has a starting point Property:P49 but actually several new starts and more than one end Property:P50. People found a lodge, the lodge is closed (in the events of the French Revolution), it re-opens in the national wave of the Napoleonic wars, it is closed again in 1935 and re-eopened in 1949 or later or never.

Is it always the same lodge? What do we do with changing names? The German National Library creates loads of items under ever new names - and you have to make sure to get the respective successions. The alternative: If those involved wanted to state that they actually are the same people (even if that is only a fiction of continuity) let them:

The present solution: We have started to use Property:P137 "History" and Item:Q94446 "active phase":

See Item:Q10575 Lodge "Ernst zum Compaß", Gotha

We can now use qualifiers on each active phase in order to state a respective beginning and a respective end. Is this a good option? How does it work in SPARQL searches if you want to give a time line? How does the option agree with the general P49/P50 use on items?

--Timeline is fine. Try for example Gotha's Lodge between 1760 and 1830. No problem either with the general P49/P50 (you can use OPTIONAL for the history) --Bruno Belhoste (talk) 18:46, 14 May 2020 (CEST)

Beautiful. You know how to script these things. (And you should write a blog post one of these days about practical tips - like how to get open refine connected or how to best learn scripting these things... --22:18, 14 May 2020 (CEST)
Just tested it with all the lodges that have these markers. Turns out that the search does not get the continuities. I added labels to show that.
The visualisation packs the histories without understanding the continuities. --Olaf Simons (talk) 07:24, 15 May 2020 (CEST)
yes, there is a problem. I don't see any possibility to solve it because the way the Timeline module of the query service put the data in the layout is to maximize the compacity and there is no way to constrain it to align data by using any criterion; the only solution is to use another Timeline module with more features.--Bruno Belhoste (talk) 12:12, 15 May 2020 (CEST)

Changing names

An organisation can run through dozens of name changes over the years - e.g. an Early Modern Publishing house with name changes whenever a father hands down the business to son, wife, son in law etc.

We are presently using Property:P57 for a history of naming but that is not ideal since usual searches will not get to the the right names at the right moment.

It is a major question. I am not convinced by using Property:P57 for the reason you give. In my view, as historians, we have to give information which is as close as possible to the sources and to the actors themselves. It means that we have to name organizations as they were named in the sources. It is possible that the same organization has two or three different names at the same time. We have to choose one of its names as the reference name and put the other ones as aliases (and maybe also as objects of the property P57 naming). But, in my view, in case the organization changed its name at certain times, we have to create different items. Let me take the example of the French Academy of Sciences. It was an Old Regime institution called Académie royale des sciences Item:Q153559, then during the Revolution it became the First Class of the National Institute Item:Q153578, and finally in 1816 again Académie des sciences Item:Q153579. Is it the same institution? Maybe yes, maybe no, but, from the point of view of the historians it is much better, I think, to consider them as different institutions, at least as a basic statement. But the condition is to link these different institutions to make a chain, which will be the "intemporal institution". I propose to use Property:P6 and Property:P7 to create this chain. Suppose you want to consider in a query not the Old Regime Academy Item:Q153559 but the Academy of Sciences in the longue durée. You can create the variable ?Academy in the clause WHERE with the triple: Q153559 wdt:P7* ?Academy. If you use ?Academy in another clause you can retrieve information concerning the Academy of sciences from its foundation in 17th century to today; if you want to consider only the First Class of the Institute and the modern Academy of Sciences you can create the variable ?Academy in the clause WHERE with the triple: Q153578 wdt:P7* ?Academy, and so on. Try [1] and [2].--Bruno Belhoste (talk) 14:22, 15 May 2020 (CEST)



Ernst Howald and Henry E. Sigerist. Antonii Musa De herba vettonica ... Leipzig 1927

We created two properties for this: Property:P233 names the object - a book edition, a manuscript or any other thing that is genetically earlier. Property:P234 comes as the qualifier and offers a statement on what basis the object can be seen as a following. You might for instance link a translation to the edition that gave the original text.

The organisation is top down chronological (the guide lines in the picture above are not that beautiful, but dates on y-axis would be cool).

Objects can have multiple connections to earlier Items (a medieval scribe could use two books to create a new version of the text).

It would be cool if the P234 information became available on the lines that are connecting items.

One of the problems is here also: How do I select a family of items?

The situation at a particular point in time

Think of a house: Tenants are moving in and out - we can model that with Property:P239 "resident" and P49/50 qualifiers. You are now interested in the situation at a point in history: Who were the tenants on March 3, 1848?

If we can get this done for a house like Item:Q14572 we might be able to show a city at a point in time.

Organisational ties

Originally we thought we should use The Wikibase advantage of being able to create any imaginable statement to do just that. A lodge has a "Mother lodge", so do create a link to that (with the daughter lodge respective property). It can chose an umbrella organisation, it will accept an obedience (adhering a system) etc.

The problem of these specific properties

  • is that a neighbouring organisation might have its own nomenclature for pretty much the same dependencies - or that it can have the same terms, but mean something different with them.
  • that you need to know the specific terminology in order to run a query

The general solution could be a standard option like "organisational ties" and use that property to specify them with the quakifiers for the specific tie.

A more specific solution can lie in between these poles: We create a pattern of general types of these ties, so that we can then ask broad questions in order to see different networks.

  1. is owned by / owns
  2. parent organisation / subsidiaries
  3. received the patent from / granted patents to
  4. represented by / representing
  5. member of / members
  6. partner organisations
  7. organisationally supported by / supporting with organisational help
  8. financed by / finances
  9. recognised by / recognises
  10. next hierarchical level level above / next hierarchical level underneath

One would now use qualifiers to give the exact terminologies. Which of these are redundant? which of the are missing? #

Is there a better option?

Data and Metadata

It is important to carefully distinguish between data and metadata. A data is a specific entity; a metadata is a data that provides information about other datas; it is also called a class. An entity is defined as a data by P2:instance of, and as a metadata (class) by P3:subclass of. P2 and P3 are exclusive: it means that a data cannot be at the same time an "instance of" and a "subclass of" the same data.

Defining classes (and subclasses) improves queries tremenduouly.

For example: "University of Erfurt" (Q11263) is "an instance of" (P2) "university" (Q11307), which is "a subclass of" (P3) "higher education institution" (Q144732), which is "a subclass of" (P3) "education institution" (Q160273), which is "a subclass of" (P3) "organisation" (Q12).

If you want to get only the universities (including the University of Erfurt), you make the simple query:

?universities wdt:P2 wd:Q11307

If you want to get all the education institutions (including the universities, and especially the University of Erfurt), you make a property-path query (using the slash /):

?educationInstitution wdt:P2/wdt:P3* wd:Q160273.

The star * means that you go through all the subclasses of wd:Q160273.

In conclusion, use P3:"subclass of" and not P2:"instance of" when you create an item which is a metadata. Quite often the superclass of this subclass does not exist and you have to create it at the same time. It is a bottom-up way to develop the ontology of Factgrid. --Bruno Belhoste (talk) 15:55, 20 May 2020 (CEST)

A cohesive data model for people (and careers)

We will need a more cohesive data model for people, especially to note positions, employments, offices held.

I started with my own CV as I felt I could handle this without deeper recourse to complex data models

  • Olaf Simons - a CV with modern information about employments

Our Illuminati biographies required greater attention to membership and offices or ranks these people held:

  • Christian Georg von Helmolt - a CV with extensive information about (masonic offices) held by von Helmolt, note here the combination of P266 offices held with qualifier P267 organisational context.

The third wave of biographies derived from the Thüringer Pfarrerbücher wanted to be sychronised with information that appeared on the different pastorates. Note in both Items the use of positions and

Barbara Kröger and Christian Popp of the Germania Sacra project proposed this arrangement for

  • Heinrich Belitz - note here the use of the career statement and the use of P91 (membership) as a qualifier.

My proposal would be to use P91 in the first level triples, as they give us valuable information about the organisational networking. If we bring Belitz life into the Christian Georg von Helmolt pattern we will create a CV like

It is essentially irrelevant how we do it - and to some extent merely a question of transformations that will be more or less easy. We would have to transform some 2000 CVs of pastors to state that they were pastors in several places. The ugly part is that much of this would become manual work: Wikibase does not like a QuickStatement input with repetitive statements: If a person is a pastor in 4 different positions the automatic input will state once that he was a pastor, it will then list the positions - and, terrible, all the begin dates and all the end dates in two separate heaps of no further use.

It is clear on the other hand why one might not like to state that person X was employed at Abbey Y in the position of an abbot.

We can pragmatically allow all these statements side by side. The software is just collecting triples and none of these triples is wrong in itself. The problem is here basically that the same searches will not work throughout the database's range - which will become nasty if ever we manage to run a simple search interface with a few standard input fields on the database.

A simple way to set a compromise is to slightly rename certain Properties to make them work from the 12th century abbot to the 21st century employee.

A side remark on the career statement option: I have used this Property with the hope to win a specialist on professions who would use her own database to revise and differentiate this mess with her own interest in jobs. That is basically why I hesitated to turn that very open statement into a base. (The second reason I mentioned above: the automatic input would produce a lot of work once we began to use repetitive statements of the same with ever changing qualifiers.)

An exemplary CV to serve als the lives I have just mentioned would be welcome. --Olaf Simons (talk)

Bruno Belhoste

I think it is important that all FactGrid projects adopt the same basic model for the actors (individuals but also institutions) to make the queries easier. Obviously, each project can propose extensions to this basic model according to its needs.

Career modeling is a delicate point. As Olaf explains, it is exactly the same to state the offices held by a person and then, as qualifiers, the institutions concerned, or to state the institutions where the person is active and then, as qualifiers, the offices held. However, it is a major change for queries. Since Olaf opted for the first solution, it is clear that all projects should adopt it.

Olaf points out a serious problem wich occurs in all cases: when you make an input by Quickstatement, you cannot repeat the same statement twice by varying the qualifiers, because then these qualifiers are attributed to both statements indiscriminately. For example, Georg Eckolt (Q41800) is at Pastorate Gräfentonna first as a deacon and then as a pastor, but the start and end dates are not distinguished.

The data can be entered by hand, but this can be very tedious. In my opinion, this is not the best solution. In fact, I think that there is a problem of data modeling. For each person, each occupation must be unique. This means, for example, that Georg Eckolt is not a pastor twice, first in Emleben and then in Gräfentonna; he is a pastor (general declaration, which is not mandatory), a pastor in Emleben (second declaration) and a pastor in Gräfentonna (third declaration). Two items must therefore be created: "pastor in Emleben" and "pastor in Gräfentonna", which are two instances of the item "pastor" (which is a class) with the locations Emleben and Grätentonna respectively (by the way, notice that Friedrich Schösser (Q41748) is also "pastor in Grätentonna"). It will solve all the problems with qualifiers. --Bruno Belhoste (talk) 18:15, 23 May 2020 (CEST)