FactGrid:Setup: Difference between revisions

From FactGrid
Jump to navigation Jump to search
(→‎Reconciliation service: add manifest (for editing))
(→‎Dumps: dump removal seems to work fine now)
Line 149: Line 149:
<code>dump-json.timer</code> runs that service each day at 21:00 (CET).
<code>dump-json.timer</code> runs that service each day at 21:00 (CET).
<code>/srv/dumps/</code> is symlinked into <code>/var/www/</code>;
<code>/srv/dumps/</code> is symlinked into <code>/var/www/</code>;
<code>systemd-tmpfiles-clean.service</code>, configured via <code>/etc/tmpfiles.d/dumps.conf</code>, ought to remove dumps after 90 days (though so far that hasn’t been tested).
<code>systemd-tmpfiles-clean.service</code>, configured via <code>/etc/tmpfiles.d/dumps.conf</code>, removes dumps after 90 days.


== Reconciliation service ==
== Reconciliation service ==

Revision as of 13:36, 19 December 2021

This page describes the technical setup of the FactGrid website and services. FactGrid currently runs on a single virtual server, and all the file system paths mentioned here refer to that server.

Database Details

  • CPU: laut /proc/cpuinfo 4× Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
  • RAM: 7.7 GiB bzw. 8.1 GB laut free, 8068724 kB laut /proc/meminfo (zzgl. 7.9 GiB bzw. 8.3 GB swap)
  • free-Schnappschuss (niedrige Last): 3.3 GiB used, 4.3 buff/cache
  • HD: 976 GiB bzw. 1.1 TB laut df, ext4, über LVM (aber soweit ich sehe nur auf einer Festplatte, die wiederum ist aber laut lsblk virtuell (s.u.); davon verwendet: 133 GiB bzw. 143 GB, also 15% Festplattenauslastung
  • VM: vmware laut systemd-detect-virt
  • OS: Debian GNU/Linux 9 (stretch) laut /etc/os-release; allerdings php7.4 statt php7.0 (von packages.sury.org/php)

Das ist das System, auf sowohl das Wiki (Webserver, PHP) als auch der Query Service (Blazegraph plus Updater) laufen (d.h. ist bis jetzt nicht über mehrere Systeme verteilt worden). Details zum Setup im Folgenden:

Packages

To install PHP 7.4 instead of 7.0 (which Debian Stretch ships but MediaWiki 1.35 is no longer compatible with), I used the package archive of the Debian PHP maintainer, following the packages.sury.org/php README. (I confirmed that sury.org belongs to the Debian PHP maintainer by checking the QA page linked on the Stretch PHP 7.0 package.)

Additional packages installed include:

  • php-dom for MediaWiki
  • php-mbstring for MediaWiki
  • php-xml for MediaWiki
  • php-gmp for MediaWiki (suggested by wikimedia/avro, not sure if needed but can’t hurt)
  • php-intl for Unicode support in QuickStatements
  • for building a local Python (for OpenRefine-Wikibase reconciliation service):
    • build-essential
    • libssl1.0-dev
    • libreadline-dev
    • zlib1g-dev
    • libffi-dev
  • redis-server for OpenRefine-Wikibase reconciliation service

This list is probably incomplete. I hope to add to it in the future if any further packages are installed, but many existing installed packages are not recorded here.

MediaWiki

MediaWiki is installed as a Git clone of the REL1_35 branch under /var/www/w-1.35/, symlinked into /var/www/w/. Apache serves /var/www/ as document root, with the standard MediaWiki short URL setup to rewrite /wiki/ into /w/index.php. MediaWiki extensions and skins are checked out as Git repositories (some of them are registered as submodules in the REL1_35 branch), but vendor/ is installed via Composer, instead of using mediawiki-vendor. (A composer.local.json file instructs Composer to include dependencies of extensions and skins.) Image uploads are enabled (images is owned by www-data:www-data).

The job queue is processed by the mediawiki-jobqueue.service unit.

QuickStatements

The git repositories for quickstatements and its dependency magnustools are cloned under /srv/, and symlinks in /var/www/ point into their public_html/ subdirectories. There is an oauth.ini configuration file in /srv/quickstatements/ (for this consumer, with a request modeled after the original Wikidata consumer), and a config.json file in /src/quickstatements/public_html/ describes the URL layout of the FactGrid site and selects FactGrid as the site to use. Logs go to /srv/quickstatements/tool.log, which is owned by the www-data group and group-writable.

Batches which the user requests to run in the background, instead of directly in the browser, are saved to the factgridquickstatements database, to which the quickstatements SQL user has access; an ugly hack in the openDbTool function in /srv/magnustools/public_html/php/ToolforgeCommon.php overrides the normal (very Toolforge-specific) database access code to instead open that database using the password residing in the /srv/quickstatements/db-password file, which is owned by the www-data group and group- but not world-readable. QuickStatements has also been patched to format batch links in its edit summaries using the quickstatements: link prefix, instead of the usual toollabs:quickstatements/; the quickstatements: interwiki prefix was installed with the following command (via the maintenance/sql.php script):

INSERT INTO factgridinterwiki (iw_prefix, iw_url, iw_local, iw_trans) VALUES ('quickstatements', '/quickstatements/$1', 1, 0);

The bot which actually processes the batches runs as quickstatements-bot.service, loading batches from the database and sending the appropriate edit requests to the API. (When it has nothing to do, it sleeps in one-second intervals.)

Make sure to run systemctl restart quickstatements-bot whenever code changes to QuickStatements are made, otherwise the bot will not pick them up.

Reasonator

The git repository for reasonator is cloned under /srv, and a symlink in /var/www/ points into its public_html/v2/ subdirectory. config.json is copied from config.json.template with some property IDs replaced with their FactGrid equivalent, a few replaced with “TODO”, and most other property IDs completely removed because they don’t apply to FactGrid. There are also minor uncommitted changes in vue.js (avoid CORS errors) and main-page.html (replace example items), though hopefully those should become unnecessary in the future.

Query service

Upstream instructions:

The query service source is cloned in ~factgrid/wikidata-query-rdf/, built using ant as described in the “getting started” document, and unzipped into /srv/wdqs-0.3.5-SNAPSHOT/ (to which /srv/wdqs/ is a symlink). RWStore.properties is edited to adjust the location of the journal file, which we have in /var/lib/wdqs/factgrid.jnl. There is also a nearly-empty mwservices.conf file ({"services":{},"endpoints":[]}) to avoid a warning if that file is missing completely.

The query service itself runs as the blazegraph.service systemd unit (run systemctl cat blazegraph to see the configuration files). Its standard output and error go to the journal, and can be viewed by administrators with journalctl -u blazegraph (add -e for the latest messages).

Apache2 is configured (/etc/apache2/sites-available/001-factgrid-ssl.conf) to forward requests to /sparql to Blazegraph. It adds Blazegraph-specific request headers to enforce a max query time (60 seconds) and read-only mode, and an Access-Control-Allow-Origin response header to allow client-side JavaScript code to read query responses without restrictions.

The updater for the query service, which reads updates from the wiki’s recent changes and applies them to the query service, similarly runs as blazegraph-update.service.

The query service UI is cloned in ~factgrid/wikidata-query-rdf/gui/. It can be built using npm run build, and the resulting build/ directory is then copied into /var/www/, with a symlink /var/www/query pointing to the latest version. A few of the files in the repository have uncommitted changes specific to FactGrid; before updating the GUI, they have to be stashed away.

git stash save &&
git pull &&
git stash pop &&
npm install &&
npm run build &&
cp -a custom-config.json build/ &&
now=$(date -Iseconds) &&
cp -a build/ /var/www/query-"$now" &&
ln -sfT query-"$now" /var/www/query # atomically update symlink
# optional: remove the old /var/www/query-* directory

Dumps

dump-json.service creates a gzip-compressed JSON dump in /srv/dumps/, named after the current date (ISO 8601 format). dump-json.timer runs that service each day at 21:00 (CET). /srv/dumps/ is symlinked into /var/www/; systemd-tmpfiles-clean.service, configured via /etc/tmpfiles.d/dumps.conf, removes dumps after 90 days.

Reconciliation service

An instance of the openrefine-wikibase service is installed in /home/factgrid/openrefine-wikibase/, using a locally built Python 3.9.9 (sources in /home/factgrid/Python-3.9.9/, installed using make altinstall under prefix /usr/local/), dependencies in a venv under .venv/, and configuration in config.py. openrefine-wikibase.service runs the service on localhost, port 8000; Apache is configured to proxy https://database.factgrid.de/reconcile/ to this service, which means the actual reconciliation service URL to configure in OpenRefine is https://database.factgrid.de/reconcile/en/api, or https://database.factgrid.de/reconcile/de/api for German labels/descriptions. A Wikibase manifest for OpenRefine is available at https://database.factgrid.de/factgrid-manifest.json.