FactGrid talk:PhiloBiblon: Difference between revisions

From FactGrid
Jump to navigation Jump to search
Line 92: Line 92:


...all these are quantities we can easily prepare on spreadsheets with the advantage that we can take a look at these data together before we feed them into the machine. The places are a good starting point as they will be just the landscape to refer to, not the complex objects themselves. --[[User:Olaf Simons|Olaf Simons]] ([[User talk:Olaf Simons|talk]]) 14:17, 20 May 2021 (CEST)
...all these are quantities we can easily prepare on spreadsheets with the advantage that we can take a look at these data together before we feed them into the machine. The places are a good starting point as they will be just the landscape to refer to, not the complex objects themselves. --[[User:Olaf Simons|Olaf Simons]] ([[User talk:Olaf Simons|talk]]) 14:17, 20 May 2021 (CEST)
PhiloBiblon statistics
I tried to set this up in spreadsheet format, but couldn't do it. I've sent you the spreadsheet via e-mail
We currently have a total of 421,293 records. If we do as you suggest, e.g., ingesting existing sets of toponyms and other relevant vocabularies, the number will be much higher.
--[[User:Charles Faulhaber|Charles Faulhaber]] ([[User talk:Charles Faulhaber|talk]]) 01:18, 21 May 2021 (CEST)


== Dan Gullo's suggestions (17 May 2021) for MSS to use as test objects ==
== Dan Gullo's suggestions (17 May 2021) for MSS to use as test objects ==

Revision as of 00:18, 21 May 2021

Further Project Pages

1st Web-Meeting, 18 May 2021

Here are the notes from the meeting 17 May 2021:

Philobiblon Meeting Notes

Dear Colleagues,

Here's what we have in the Workplan for this summer:

Objectives: Review of Wikibase software: standards, protocols, data formats, and implementations; comparison with PhiloBiblon schema and data dictionaries and current and desired functionalities. Scenarios to describe how contributors and users will interact with PhiloBiblon. Functional specifications for the features required for those scenarios. Identify a small set of particularly rich and related records (number TBD) from each of PhiloBiblon’s four databases (BETA, BIPA, BITAGAP, BITECA) and ten tables to serve as test cases. Ongoing data clean-up in legacy PhiloBiblon Tasks: T1. Review Wikibase software: standards, protocols, data formats, implementations (PI, Anderson, Formentí, Simons) T2. Develop user scenarios based on PhiloBiblon data and current and desired functionalities. (PI, academic staff, Adv. Board: Dagenais, Gullo) T3. Create functional specification for features needed to re-create these functionalities. (PI, Anderson, Formentí) T4. Identify set of test target records (PI, academic staff) T5. Ongoing clean-up of legacy data (PI, academic staff)

5/18/21 Meeting Notes

Attendance: Adam Anderson (zoom host), data analyst for project, Berkeley Lecturer Charles Faulhaber, Project PI, and (new) director of Bancroft Library, Josep Formenti (software engineer in Barcelona, also knows NLP and webdev), Olaf Simons, book historian at Uni Erfurt, Gothe. WikiMedia platform & FactGrid Randal Brandt, head of cataloging at UCB, rare books, Xavier Agenjo (director of projects for Fundación Ignacio Larramendi in Madrid, creator of more then 40digital libraries), Daniel Gullo (dgullo@csbsju.edu) director of collections, cataloging, creating databases of libraries, modern digital collections, controlled vocabulary for underrepresented religious traditions in the middle ages (vhimmel online database, NEH project director) Óscar Perea Rodriguez, lecturer at USF, working with Charles on this since 2002 Jason Kovari (cataloging rare books), Cornell Robert Sanderson, director of digital collections at Yale Cliff Lynch, director of a small nonprofit in DC Coalition for Networked Information, in the School of information at UC Berkeley; worked on the predecessor of the CDL. John May (phone): software dev., information management systems, designer of PhiloBiblon (software) John Dagenais: Professor of Spanish at UCLA, degree in library science, user of PhiloBiblon

PhiloBiblon Project: Goes back to 1975, as a spin-off of the Dictionary of the Old Spanish Language project at U of Wisconsin-Madison, a Spanish version of the OED. Based on contemporary uses of the language. To do this they created an in-house data base (1975 onward): Bibliography of Old Spanish Texts (BOOST), lineal ancestor of PhiloBiblon. Constant technology change. Put the collection on CD-OM discs by 1992, with digital images + text. 1994 with the internet (Netscape). Charles was teaching a course on DH computing (including gopher, OCR, etc). One of his students introduced him to the World Wide Web and he said that it wasn’t going to be important… Since then they’ve been working on keeping up with the different versions (1.0, 2.0, 3.0 = Linked Open Data with RDF). Currently Exports data from Windows PhiloBiblon into XML files to upload to the server at Berkeley, where XTF (eXtensible Text Framework), run from CDL takes large XML files from 9 PhiloBiblon tables (uniform-title, manuscripts/editions, persons, etc.) and parses each of these files into individual records for querying. Objective: to get us aligned with the Wikibase / Wiki-foundation, to piggyback on their data and technology moving forward.

Olaf Simons: FactGrid (works on Masonic institutions) Charles found Olof through commentary in their blogpost. Wikibase is the software behind the Wikidata project. It’s developed between 2012-ongoing. Used by national libraries worldwide to control data: creating triple-base statements--2 entities linked by a relation--(annotate with statements) and further develop metadata. You can see who is editing it by the minute. Each entity has a Q-number, e.g., for each person (gender, address, field of research, etc.) you can add as many statements as you want, and you can qualify these statements. You can add references to the entry (as many as you want). Used as a source for anything imaginable, we collect statements that become entries. Using SparQL you can query this: e.g. Illuminati: members list Q: date of birth? and it appears as a column for each member SparQL no longer connects text input fields. It normally connects items to items. E.g. a person is a member of a lodge, which is another item which contains informations with statements. Each item is a database object. We will have to translate input field information into objects, and for that to work, each object needs a statement. First Point: We don’t deal with books, instead you have people, places and institutions connected to these books. All of these need statements to work. Produce a network of items of relations. Consider the types of objects (e.g. Geographic names, Proper Nouns, etc.). We’ll need to get an idea of the number of items you’ll be creating beforehand. Should create your own team and get accounts for those members, along with managers of the accounts to run the team independently. Charles: For Spanish texts (6K texts) 8K individuals PhiloBiblon “Data clips” correspond to P properties in Wikibase.

Creating properties is not the issue. You create properties as needed and link them on a text-by-text basis.

The real problem is to understand the types of objects you have to create, and how they’re interconnected. They need to be created before you can link them.

E.g. often we start with places, and move from there. First is to create the objects, you then interconnect it.

Usually projects come to us with a spreadsheet…

Foreseeable Problems: The entire project is multi-lingual (using P-numbers, Q-numbers (Deutsche Nationalbibliothek GND)), which allow for different labels on these items and properties. You should use it in the language of your sources. The database currently accommodates French, English and German (the main users of the database). They put their labels in the database. You will need to make labels in Spanish and Portugese. Programming issues: the software is not designed to create presentable projects at the moment. This is ongoing work: 1) the FactGrid viewer, which creates pages based on information from the database, created on the fly. You will want something similar, for your own data. Bruno Belosst is the creator, you can contact him. This is where Wikibase is currently underdeveloped. A ‘Knowledge’ tool that is in development which can also show the metadata for each entity. Otherwise you get the SparQL query functionality.

Robert Sanderson to Everyone (12:32 PM) We used XTF at Getty for our archives: https://xtf.cdlib.org/ For what it’s worth, at Yale we’re doing this across the libraries, archives and museums. The current data is 45 million such entities, spanning 2.5 billion triples uses the same process at Yale. The difficulty is understanding the modeling and doing the transformation. The database side is essentially free. We have 15M MARC records, which get turned into 45M entities, people, places, concepts, objects, images, digital things. (8 classes) We’re not using Wikibase, but rather the main library standards by themselves The difference between the data modeling paradigms will necessitate the creation of new identifiers But wikibase is really good at managing external, typed identifiers :) We should write down the things you want to refer to.. e.g. ‘ownership’ (provenance in the museum and library world) if that’s something you want to refer independently to the object. From there you can work out the CSV tables and get them into the system Not to be a broken record, but the modeling also determines the possibilities for the queries. If you have (for example) place-written and date-written, on a text that can have multiple authors writing at different times and at different places, then you need some way to connect author / place and time *in the model*

Daniel Gullo: Biblissima is using this model.: https://data.biblissima.fr/w/Accueil question about Q-numbers… Warning: scholars use the MS-ids (manid), text-ids (texid), and other IDs established in the PhiloBiblon system.

Olaf: You will create external IDs (vital for places, e.g.) for any item in PhiloBiblon, and they will be linked to the IDs from PhiloBiblon. The Q-numbers are merged based on double entries (there’s movement, they are not deleted, but merged). There are ways of creating more key-value pairs without deleting IDs.

This is similar to the database we’re developing… When you’re migrating your data, you want to think of it hierarchically with a controlled vocabulary: 1) Continents, 2) Geographic names, 3) institutions, 4) other entities.

Jason Kovari Is the plan to assess what users really want? (or move forward with what you have?) Is part of the workplan assessing the work that you initially have? Jason Kovari (he/him) to Everyone (12:50 PM) The word I was forgetting earlier was Affinity. There is a Wikidata Affinity Group as part of the LD4 Community: https://www.wikidata.org/wiki/Wikidata:WikiProject_LD4_Wikidata_Affinity_Group

Charles: There’s an orthogonal relationship between the ‘library world’ and what we’re doing. PhiloBiblon has evolved over 35 years in response to the needs of the user community as represented primarily by the members of the four PhiloBiblon teams (for Spanish, Portuguese/Galician-Portuguese, Catalan, Golden Age poetry). To what extent do we need to accommodate what the library world has been doing? We’re not reimagining PhiloBiblon, but rather trying to move it into the web-based wiki world. We want to incorporate external authority files, such as Virtual International Authority file (VIAF), Getty Thesauri, Denis Muzarelle, Vocabulaire codicologique Rob and Olaf: Suggest starting with a small number of records (e.g. 2 or 3 codices based on your needs). This has been done. Choose complex items (10 of each sort) and go from there. E.g. items for each: Books (with Q-numbers for all the books) Toponyms Institutions etc.

What’s the process of getting PhiloBiblon into Wikibase? Use quick statements (or API) / spreadsheet into the machine. It takes a few hours to feed the data into the machine, but it’s simple. The benefit of a CSV import is you can use OpenRefine to augment the identifiers and reconcile data. How many items on the largest spreadsheet?

John May: (project janitor, coder) to perform rectifications and homogenization in the Windows database. It’s straightforward to export in CSV / spreadsheet. They are exceptionally large. The problem might be that the creation of Q and P numbers can be done programmatically, but that ontology will need to be made clear. Design desiderata: to accommodate ambiguity and uncertainty. So PhiloBiblon is made to be flexible and customizable. He’ll await the orders from Olof as to how he wants the data. The PhiloBiblon DBMS is an n-dimensional dynamic data model (which can be normalized as needed) ca. 60 mb for each bibliography (spreadsheet); no fixed field lengths; any structure / sub-sub-structure can contain structure - it’s all based on arrays… 6K texts 14K witnesses John can write the code and tag the texts with P / Q numbers that would correspond to the Wikibase PhiloBiblon is not a standard data model… so it will take some time to get the CSV…

How is PhiloBiblon used currently? Charles: the primary use of PhiloBiblon currently:

For editors of texts: How many Mss / printed editions are there of a given text? Descriptions of MSS There are lots of other potential queries based on data in PhiloBiblon, but it is currently difficult to extract it, e.g.:

Codicology: How many Mss that have gatherings of 12 leaves, and where do these Mss come from? Where and when are particularly catchword of signature types used in gatherings?

Prosopography: What individuals were active in a given location in a given time period

Oscar: there’s not a lot of interaction with PhiloBiblon, although some users do download forms from our Collaborate page, fill them out, and return them for incorporation into PhiloBiblon Xavier: demo of SparQL in a nice web editor from the Biblioteca Dixital Galiciana… (Olaf was impressed with the editor)

Olaf: FactGrid is a limited user database, so only users with accounts can change things. Everyone in the project will get an account. Users like the ones Charles describes can be given an account. There’s no central form for making ‘issues’, you just change things when you see a mistake. People who make a change add a reference about the change. You can keep mistaken information on the database to make sure people don’t change it back. It remains on the database ‘downgraded’ as mistaken information. Qualifiers are created for hypothetical statements. The public can see everything and do any search but it’s not open to the public. Once you have your data in qualified triples, it will be used by different users depending on their own interests..

Charles: Charles: We want to facilitate crowd-sourcing in order to tap into Hispanists all over Europe who can describe MSS in their local libraries and add tht information to PhiloBiblon directly.

Olaf: “spread accounts” - We can do his by giving everyone interested in contributing an account. Wikidata doesn’t have working hypotheses, they want facts, not theories. Whereas FactGrid allows for the type of knowledge (hypothetical, guess, needs to be substantiated) You make editorial control / blocking items… You can put items on a watch list and check who edits it. Otherwise everyone is allowed to edit and we watch the items as they get edited. Olof will give people accounts (based on email addresses, website / institutional titles), but he can also give admin accounts and show you how to create these accounts…

Charles: Would it be useful to add a field in PhiloBiblon for a WikiBase Q / P number? (as we correct our data),

Yes, it’s advisable FactGrid will show these for Wikidata numbers this will be an interactive process, back and forth, until we get it right…

John May: In the production of these spreadsheets, what are we trying to do? E.g. there are 200 cells (with structure within, including P-numbers, and Q-numbers). In a normalized table, we would break these up, but they are not in that format yet. How are P / Q numbers assigned? —— Tasks: Get familiar with FactGrid & Wikibase Adam will be working on linking the entities between PhiloBiblon and Wikibase. Get admin account from Olaf (so you can create accounts for the project) Take the framework and CSV which John May will make and work with Olaf, Josep, and Jason Once we have the CSV, contact Jason Kovari: questioning whether we need entities or just metadata for the text? Josep will work on obtaining the entities from the texts (see interface). Josep will work on making an API for easy input options Rob: We should write down the primary things you want to refer to.. e.g. ‘ownership’ (provenance in the museum and library world) if that’s something you want to refer independently to the object. From there you can work out the CSV tables and get them into the system Everyone: Communicate all problems on the project page, so anyone can see the discussion that led to certain solutions (so our models get adopted). Transparency is key to that to follow your line of thinking.

Olaf's thoughts

My questions what kind of objects we are dealing with? in what quantities? did not aim at a datamodel which we would need before we start. We are not working on a conventional database where you have to know categories of objects before you start. (We do only have one sort of objects - they have Q-Numbers, and they attract statements without any need to create conceptual borders.)

My question was intended to detect hidden fields. Some databases have texts and people - and the organisers do not realise that they also have places - simply because they run these places as text input fields. Places will be Items as well and they are tricky. FactGrid has for example 11 Items under the name "Paris". 10 of these are places, one is a family name. If you connect manuscripts or people to "Paris" this will be easy to match as long as you already have your Paris with an external identifier such as the GEOnames ID. If places are, however, are just referenced as strings so far, that will be work to identify the respective Q-Items on our site. We have only some 20 major Iberian places on FactGrid at this moment. You will create your own references - and I assume that you will prefer to feed c. 30.000 Iberian places into FacGrid rather than create the c. 8,000 places you would need right now for the 8,000 people. The import of the whole package (from a good source with Wikidata-Numbers, Geonames ID and standard references of your choice plus coordinates) saves a lot of work later whenever you link information to a new place.

My present sum is here:

  • 6,000 Codices
  • 14,000 Witnesses
  • 8,000 Individuals
  • 30,000 Places (just a guess after checking Wikidata on villages and human settlements)
  • ..... Organisations (Libraries etc.)

...all these are quantities we can easily prepare on spreadsheets with the advantage that we can take a look at these data together before we feed them into the machine. The places are a good starting point as they will be just the landscape to refer to, not the complex objects themselves. --Olaf Simons (talk) 14:17, 20 May 2021 (CEST)

PhiloBiblon statistics

I tried to set this up in spreadsheet format, but couldn't do it. I've sent you the spreadsheet via e-mail

We currently have a total of 421,293 records. If we do as you suggest, e.g., ingesting existing sets of toponyms and other relevant vocabularies, the number will be much higher. --Charles Faulhaber (talk) 01:18, 21 May 2021 (CEST)

Dan Gullo's suggestions (17 May 2021) for MSS to use as test objects

When choosing records, I would suggest some basic criteria (I put this here before my suggestions).

Well-established authors that are found in multiple collections Well-established authors that are found in multiple languages Well-established authors that may be found outside of Spanish or Portuguese libraries Complex author-title relationships, that may involve a known translator and a known author Complex title-title relationships, perhaps a commentary on a known work. Complex titles with multiple variants, perhaps by language Works that are in print and manuscript Works with established bibliography

I would suggest Gonzalo de Berceo and known works to deal with a major author

  • BETA bioid 1211

I would suggest Isaac of Ninevah and the translations and printed editions of his works (serious issues to wrestle with here because of the need to reconcile your data so that one author moves from 3 ids in Philobiblon to one Qid in Wikibase)

  • BETA bioid 1186 BETA bioid 1398 BITAGAP bioid 1079

I would suggest Pseudo-Seneca for the same reasons as Isaac of Ninevah

  • BETA bioid 1192 BITAGAP bioid 1107 BITECA bioid 6379

For particularly interesting records to think about the complexity of data migration, here are three.

Very complex record

  • BETA manid 1567

Record which is not a complete manuscript, so how do you want to represent a part of a manuscript, and not a complete manuscript.

  • BITECA manid 2554

For a printed book with complex data

  • BETA manid 1510

For institutions, you can use mine for fun, because you have it three time in Philobiblon, all with three different names.

BITECA libid 1128 BITAGAP libid 852 BETA libid 668

Test Items created in 2019

Do links to PhiloBiblon have to be dynamic?

I am asking this because I tried to create an External identifier from FG to PB.

The external identifier could lead directly into the data set if I had a link with a stable environment and just the ID changing. I would replace the ID position in the URL with $1 and be able to link from the number into the PB item page... Seems that does not work, as I see you do not give any links into your own database...

Just by the way: sign all comments with ~~~~ - that creates automatic and dated signatures when you press save. --Olaf Simons (talk) 22:55, 19 May 2021 (CEST)

PhiloBiblon URLs are created dynamically, but they are stable. We link to them all the time from external web pages, such as in our blog. Typically we hide the URL strings within / underneath the text description. I note that when I copy such items to Wikidata the underlying link disappears and has to be added as a real URL. Thus BETA manid 1106 (Toledo: Biblioteca Capitular, 43-13 (https://pb.lib.berkeley.edu/xtf/servlet/org.cdlib.xtf.dynaXML.DynaXML?source=BETA/Display/1106BETA.MsEd.xml&style=MsEd.xsl%0A%0A%20%0A%20%0A%20&gobk=http%3A%2F%2Fpb.lib.berkeley.edu%2Fxtf%2Fservlet%2Forg.cdlib.xtf.crossQuery.CrossQuery%3Frmode%3Dphilobeta%26mstype%3DM%26everyone%3D%26city%3D%26library%3D%26shelfmark%3D13+43%26daterange%3D%26placeofprod%3D%26scribe%3D%26publisher%3D%26prevowner%3D%26assocname%3D%26subject%3D%26text-join%3Dand%26browseout%3Dmsed%26sort%3Dtitle) Charles Faulhaber (talk) 22:16, 20 May 2021 (CEST)

PhiloBiblon Project Page in multiple languages

I would like to add the Spanish, Catalan, and Portuguese versions of this basic description of PhiloBiblon 21:35, 20 May 2021 (CEST)


---

Two ways: You log in in the language which you want to add, and then you will also see the Properties that should be translated as well.

Option 2: You use the Quickstatements input. Make your Statements in Excel or on a Google Spreadsheets and then paste them into the QuickStatements field and use Version 1 input. This is the content for the three columns of a basic triple Statement that sets a Label, Description or Alias.

QNumber - Les - "Label in Spanish"
QNumber - Des - "Label in Spanish"
Qnumber - Aes - "Alias in Spanish"
QNumber - Lca - "Label in Catalan"
...

all the other Language codes: here.

Recommendable:

  • Create Items always with a Batch input (see the menu) with all the languages you need.
  • Work always in the language of your sources and translate Properties into your language while you are using the software (or when you feel you have nothing better to do - that's the day when you can translate all 600 Properties and their descriptions into Spanish and Catalan). --Olaf Simons (talk) 21:48, 20 May 2021 (CEST)

Linking pages to sub-pages

I created a new item:

BETA / Bibliografía Española de Textos Antiguos

Now I want to link it to the PhiloBiblon page as "part of".

It's obvious that the first thing I need to do is take the Wikidata tours: https://www.wikidata.org/wiki/Wikidata:Tours

Charles Faulhaber (talk) 21:58, 20 May 2021 (CEST)

If you mean you want to say the item is part of this page - don't. We are not in the Database, just on a Wiki page. But you could create an Item PhiloBiblon and you could then make that statement. --Olaf Simons (talk) 22:16, 20 May 2021 (CEST)

Advice for beginners

I have been doing some little experiments on the FactGrid PhiloBiblon page, with much help from Olaf. We're keeping all of our discussions there. One thing he told me of importance is that you should sign all of your interventions with with four tildes, as I am doing here. You will see them at the bottom when you edit this page. When you look at the non-editable page you will see "Charles Faulhaber (talk) 22:07, 20 May 2021 (CEST)" A little learning is a dangerous thing, so I'm going to take some time to look at the various Wikidata: Tours (https://www.wikidata.org/wiki/Wikidata:Tours )before I dive back into this. Charles Faulhaber (talk) 22:10, 20 May 2021 (CEST)