FactGrid:Setup: Difference between revisions

From FactGrid
Jump to navigation Jump to search
(→‎Query service: federation with WDQS now supported)
 
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
This page describes the technical setup of the FactGrid website and services.
This page describes the technical setup of the FactGrid website and services.
FactGrid currently runs on a single virtual server, and all the file system paths mentioned here refer to that server.
FactGrid currently runs on a single virtual server, and all the file system paths mentioned here refer to that server.
See also [[/1.39 upgrade]] for a description of the process that was used to upgrade FactGrid from MediaWiki 1.35 to 1.39.


== Database Details ==
== Database Details ==
Line 15: Line 17:


== Packages ==
== Packages ==
To install PHP 7.4 instead of 7.0 (which Debian Stretch ships but MediaWiki 1.35 is no longer compatible with), I used the package archive of the Debian PHP maintainer, following the [https://packages.sury.org/php/README.txt packages.sury.org/php README]. (I confirmed that sury.org belongs to the Debian PHP maintainer by checking the [https://qa.debian.org/developer.php?login=ondrej%40debian.org QA page] linked on the [https://packages.debian.org/stretch/php7.0 Stretch PHP 7.0 package].)


Additional packages installed include:
Additional packages installed include:
Line 25: Line 25:
* php-gmp for MediaWiki (suggested by wikimedia/avro, not sure if needed but can’t hurt)
* php-gmp for MediaWiki (suggested by wikimedia/avro, not sure if needed but can’t hurt)
* php-intl for Unicode support in QuickStatements
* php-intl for Unicode support in QuickStatements
* for building a local Python (for OpenRefine-Wikibase reconciliation service):
* php-curl for Elastica / CirrusSearch
* for building a local Python (for OpenRefine-Wikibase reconciliation service) (with the upgrade to Debian Bullseye, this is probably no longer needed):
** build-essential
** build-essential
** libssl1.0-dev
** libssl1.0-dev
Line 37: Line 38:
== MediaWiki ==
== MediaWiki ==


MediaWiki is installed as a Git clone of the REL1_35 branch under <code>/var/www/w-1.35/</code>, symlinked into <code>/var/www/w/</code>.
MediaWiki is installed as a Git clone of the REL1_39 branch under <code>/var/www/w-1.39/</code>, symlinked into <code>/var/www/w/</code>.
Apache serves <code>/var/www/</code> as document root,
Apache serves <code>/var/www/</code> as document root,
with the standard [[mw:Manual:Short URL/Apache|MediaWiki short URL setup]] to rewrite <code>/wiki/</code> into <code>/w/index.php</code>.
with the standard [[mw:Manual:Short URL/Apache|MediaWiki short URL setup]] to rewrite <code>/wiki/</code> into <code>/w/index.php</code>.
MediaWiki extensions and skins are checked out as Git repositories
MediaWiki extensions and skins are checked out as Git repositories
(some of them are registered as submodules in the REL1_35 branch),
(some of them are registered as submodules in the REL1_39 branch),
but <code>vendor/</code> is installed via Composer,
but <code>vendor/</code> is installed via Composer,
instead of using mediawiki-vendor.
instead of using mediawiki-vendor.
Line 47: Line 48:
Image uploads are enabled (<code>images</code> is owned by <code>www-data:www-data</code>).
Image uploads are enabled (<code>images</code> is owned by <code>www-data:www-data</code>).


The [[mw:Manual:Job queue|job queue]] is processed by the <code>mediawiki-jobqueue.service</code> unit.
The [[mw:Manual:Job queue|job queue]] is processed by the <code>mediawiki-jobqueue.service</code> unit,
which is configured to frequently restart itself,
to avoid having outdated PHP code run for too long as well as out-of-memory errors.
A daily <code>mediawiki-jobqueue-restart.timer</code> additionally restarts the job queue service,
to avoid situations where the job queue fails to start due to database errors and systemd gives up on restarting it forever.


== QuickStatements ==
== QuickStatements ==


The git repositories for [[phabricator:source/tool-quickstatements/|quickstatements]] and its dependency [https://bitbucket.org/magnusmanske/magnustools magnustools] are cloned under <code>/srv/</code>,
The git repositories for [https://github.com/magnusmanske/quickstatements quickstatements] and its dependency [https://bitbucket.org/magnusmanske/magnustools magnustools] are cloned under <code>/srv/</code>,
and symlinks in <code>/var/www/</code> point into their <code>public_html/</code> subdirectories.
and symlinks in <code>/var/www/</code> point into their <code>public_html/</code> subdirectories.
There is an <code>oauth.ini</code> configuration file in <code>/srv/quickstatements/</code>
(The clones were originally named <code>/srv/quickstatements</code> and <code>/srv/magnustools</code>,
but newer versions, cloned under <code>/srv/quickstatements_2023</code> and <code>/srv/magnustools_2023</code>, are used since 26 February 2023.)
There is an <code>oauth.ini</code> configuration file in <code>/srv/quickstatements_2023/</code>
(for [[Special:OAuthListConsumers/view/05133fe786f2fe4d0edbe4490e0a313a|this consumer]],
(for [[Special:OAuthListConsumers/view/05133fe786f2fe4d0edbe4490e0a313a|this consumer]],
with a request modeled after [[d:Special:OAuthListConsumers/view/77b4ae5506dd7dbb0bb07f80e3ae3ca9|the original Wikidata consumer]]),
with a request modeled after [[d:Special:OAuthListConsumers/view/77b4ae5506dd7dbb0bb07f80e3ae3ca9|the original Wikidata consumer]]),
and a <code>config.json</code> file in <code>/src/quickstatements/public_html/</code> describes the URL layout of the FactGrid site
and a <code>config.json</code> file in <code>/src/quickstatements_2023/public_html/</code> describes the URL layout of the FactGrid site
and selects FactGrid as the site to use.
and selects FactGrid as the site to use.
Logs go to <code>/srv/quickstatements/tool.log</code>,
Logs go to <code>/srv/quickstatements_2023/tool.log</code>,
which is owned by the <code>www-data</code> group and group-writable.
which is owned by the <code>www-data</code> group and group-writable.


Batches which the user requests to run in the background,
Batches which the user requests to run in the background,
instead of directly in the browser,
instead of directly in the browser,
are saved to the <code>factgridquickstatements</code> database,
are saved to the <code>quickstatements_2023</code> database,
to which the <code>quickstatements</code> SQL user has access;
to which the <code>quickstatements_2023</code> SQL user has access;
an ugly hack in the <code>openDbTool</code> function in <code>/srv/magnustools/public_html/php/ToolforgeCommon.php</code>
both the <code>openDbTool()</code> calls and <code>setAuthDbName()</code> method in QuickStatements and the <code>openDbTool()</code> function in Magnustools
overrides the normal (very Toolforge-specific) database access code
have been patched to access this database instead of the normal (very Toolforge-specific) database access code,
to instead open that database using the password residing in the <code>/srv/quickstatements/db-password</code> file,
using the password residing in the <code>/srv/quickstatements_2023/db-password</code> file,
which is owned by the <code>www-data</code> group and group- but not world-readable.
which is owned by the <code>www-data</code> group and group- but not world-readable.
QuickStatements has also been patched to format batch links in its edit summaries
QuickStatements has also been patched to format batch links in its edit summaries
Line 108: Line 115:
which we have in <code>/var/lib/wdqs/factgrid.jnl</code>;
which we have in <code>/var/lib/wdqs/factgrid.jnl</code>;
<code>mwservices.conf</code> is edited to add <code>database.factgrid.de</code> to the allowed [[:mw:Wikidata Query Service/User Manual/MWAPI|MWAPI]] endpoints;
<code>mwservices.conf</code> is edited to add <code>database.factgrid.de</code> to the allowed [[:mw:Wikidata Query Service/User Manual/MWAPI|MWAPI]] endpoints;
<code>whitelist.txt</code> is added to allow SPARQL federation with [https://query.wikidata.org/sparql WDQS] (<code>SERVICE <https://query.wikidata.org/sparql> { ... }</code>).
<code>whitelist.txt</code> is added to allow SPARQL federation with the following endpoints:
* [https://query.wikidata.org/sparql WDQS] (<code>SERVICE <https://query.wikidata.org/sparql> { ... }</code>)
* [https://dbpedia.org/sparql DBpedia] (<code>SERVICE <https://dbpedia.org/sparql> { ... }</code>)


The query service itself runs as the <code>blazegraph.service</code> systemd unit
The query service itself runs as the <code>blazegraph.service</code> systemd unit
Line 154: Line 163:


An instance of the [https://github.com/wetneb/openrefine-wikibase openrefine-wikibase] service is installed in <code>/home/factgrid/openrefine-wikibase/</code>,
An instance of the [https://github.com/wetneb/openrefine-wikibase openrefine-wikibase] service is installed in <code>/home/factgrid/openrefine-wikibase/</code>,
using a locally built Python 3.9.9 (sources in <code>/home/factgrid/Python-3.9.9/</code>, installed using <code>make altinstall</code> under prefix <code>/usr/local/</code>),
with dependencies in a venv under <code>.venv/</code> and configuration in <code>config.py</code>.
dependencies in a venv under <code>.venv/</code>,
(Prior to the upgrade to Debian 11 / Bullseye, it used a locally built Python 3.9.9 with sources in <code>/home/factgrid/Python-3.9.9/</code>, installed using <code>make altinstall</code> under prefix <code>/usr/local/</code>;
and configuration in <code>config.py</code>.
this old Python is mostly still around, because Python doesn’t provide a <code>make uninstall</code> command, but it’s no longer used, and I manually renamed the <code>/usr/local/bin</code> files to avoid confusion.
Several [https://gist.github.com/lucaswerkmeister/3ae63110c3869204db1dae26af23814c hacks] are required to make the code run under Python 3.11.)
<code>openrefine-wikibase.service</code> runs the service on localhost, port 8000;
<code>openrefine-wikibase.service</code> runs the service on localhost, port 8000;
Apache is configured to proxy https://database.factgrid.de/reconcile/ to this service,
Apache is configured to proxy https://database.factgrid.de/reconcile/ to this service,
Line 162: Line 172:
or '''https://database.factgrid.de/reconcile/de/api''' for German labels/descriptions.
or '''https://database.factgrid.de/reconcile/de/api''' for German labels/descriptions.
A Wikibase manifest for OpenRefine is available at '''https://database.factgrid.de/factgrid-manifest.json'''.
A Wikibase manifest for OpenRefine is available at '''https://database.factgrid.de/factgrid-manifest.json'''.
== ElasticSearch ==
ElasticSearch is installed via the [https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.10.2-amd64.deb 7.10.2 .deb package],
with the <code>org.wikimedia.search:extra:7.10.2-wmf4</code> and <code>org.wikimedia.search.highlighter:experimental-highlighter-elasticsearch-plugin:7.10.2</code> plugins installed via <code>/usr/share/elasticsearch/bin/elasticsearch-plugin install <var>name</var>:<var>version</var></code>.
[[mw:extension:CirrusSearch|CirrusSearch]] and [[mw:extension:WikibaseCirrusSearch|WikibaseCirrusSearch]] are installed, mainly according to the CirrusSearch README;
note that <code>$wgWBCSUseCirrus</code> must already be <code>true</code> when the search index is initialized.
<code>$wgWBRepoSettings['searchIndexTypes']</code> lists the same [[Special:ListDataTypes|property data types]] to index for <code>haswbstatement</code> search as in production:
<code>string</code>, <code>external-id</code>, <code>url</code>, <code>wikibase-item</code>, <code>wikibase-property</code>, <code>wikibase-lexeme</code>, <code>wikibase-form</code>, <code>wikibase-sense</code>.
[[Category:FactGrid Technical]]

Latest revision as of 10:42, 17 August 2024

This page describes the technical setup of the FactGrid website and services. FactGrid currently runs on a single virtual server, and all the file system paths mentioned here refer to that server.

See also /1.39 upgrade for a description of the process that was used to upgrade FactGrid from MediaWiki 1.35 to 1.39.

Database Details

  • CPU: laut /proc/cpuinfo 4× Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
  • RAM: 7.7 GiB bzw. 8.1 GB laut free, 8068724 kB laut /proc/meminfo (zzgl. 7.9 GiB bzw. 8.3 GB swap)
  • free-Schnappschuss (niedrige Last): 3.3 GiB used, 4.3 buff/cache
  • HD: 976 GiB bzw. 1.1 TB laut df, ext4, über LVM (aber soweit ich sehe nur auf einer Festplatte, die wiederum ist aber laut lsblk virtuell (s.u.); davon verwendet: 133 GiB bzw. 143 GB, also 15% Festplattenauslastung
  • VM: vmware laut systemd-detect-virt
  • OS: Debian GNU/Linux 9 (stretch) laut /etc/os-release; allerdings php7.4 statt php7.0 (von packages.sury.org/php)

Das ist das System, auf sowohl das Wiki (Webserver, PHP) als auch der Query Service (Blazegraph plus Updater) laufen (d.h. ist bis jetzt nicht über mehrere Systeme verteilt worden). Details zum Setup im Folgenden:

Packages

Additional packages installed include:

  • php-dom for MediaWiki
  • php-mbstring for MediaWiki
  • php-xml for MediaWiki
  • php-gmp for MediaWiki (suggested by wikimedia/avro, not sure if needed but can’t hurt)
  • php-intl for Unicode support in QuickStatements
  • php-curl for Elastica / CirrusSearch
  • for building a local Python (for OpenRefine-Wikibase reconciliation service) (with the upgrade to Debian Bullseye, this is probably no longer needed):
    • build-essential
    • libssl1.0-dev
    • libreadline-dev
    • zlib1g-dev
    • libffi-dev
  • redis-server for OpenRefine-Wikibase reconciliation service

This list is probably incomplete. I hope to add to it in the future if any further packages are installed, but many existing installed packages are not recorded here.

MediaWiki

MediaWiki is installed as a Git clone of the REL1_39 branch under /var/www/w-1.39/, symlinked into /var/www/w/. Apache serves /var/www/ as document root, with the standard MediaWiki short URL setup to rewrite /wiki/ into /w/index.php. MediaWiki extensions and skins are checked out as Git repositories (some of them are registered as submodules in the REL1_39 branch), but vendor/ is installed via Composer, instead of using mediawiki-vendor. (A composer.local.json file instructs Composer to include dependencies of extensions and skins.) Image uploads are enabled (images is owned by www-data:www-data).

The job queue is processed by the mediawiki-jobqueue.service unit, which is configured to frequently restart itself, to avoid having outdated PHP code run for too long as well as out-of-memory errors. A daily mediawiki-jobqueue-restart.timer additionally restarts the job queue service, to avoid situations where the job queue fails to start due to database errors and systemd gives up on restarting it forever.

QuickStatements

The git repositories for quickstatements and its dependency magnustools are cloned under /srv/, and symlinks in /var/www/ point into their public_html/ subdirectories. (The clones were originally named /srv/quickstatements and /srv/magnustools, but newer versions, cloned under /srv/quickstatements_2023 and /srv/magnustools_2023, are used since 26 February 2023.) There is an oauth.ini configuration file in /srv/quickstatements_2023/ (for this consumer, with a request modeled after the original Wikidata consumer), and a config.json file in /src/quickstatements_2023/public_html/ describes the URL layout of the FactGrid site and selects FactGrid as the site to use. Logs go to /srv/quickstatements_2023/tool.log, which is owned by the www-data group and group-writable.

Batches which the user requests to run in the background, instead of directly in the browser, are saved to the quickstatements_2023 database, to which the quickstatements_2023 SQL user has access; both the openDbTool() calls and setAuthDbName() method in QuickStatements and the openDbTool() function in Magnustools have been patched to access this database instead of the normal (very Toolforge-specific) database access code, using the password residing in the /srv/quickstatements_2023/db-password file, which is owned by the www-data group and group- but not world-readable. QuickStatements has also been patched to format batch links in its edit summaries using the quickstatements: link prefix, instead of the usual toollabs:quickstatements/; the quickstatements: interwiki prefix was installed with the following command (via the maintenance/sql.php script):

INSERT INTO factgridinterwiki (iw_prefix, iw_url, iw_local, iw_trans) VALUES ('quickstatements', '/quickstatements/$1', 1, 0);

The bot which actually processes the batches runs as quickstatements-bot.service, loading batches from the database and sending the appropriate edit requests to the API. (When it has nothing to do, it sleeps in one-second intervals.)

Make sure to run systemctl restart quickstatements-bot whenever code changes to QuickStatements are made, otherwise the bot will not pick them up.

Reasonator

The git repository for reasonator is cloned under /srv, and a symlink in /var/www/ points into its public_html/v2/ subdirectory. config.json is copied from config.json.template with some property IDs replaced with their FactGrid equivalent, a few replaced with “TODO”, and most other property IDs completely removed because they don’t apply to FactGrid. There are also minor uncommitted changes in vue.js (avoid CORS errors) and main-page.html (replace example items), though hopefully those should become unnecessary in the future.

Query service

Upstream instructions:

The query service source is cloned in ~factgrid/wikidata-query-rdf/, built using ant as described in the “getting started” document, and unzipped into /srv/wdqs-0.3.97-SNAPSHOT/ (to which /srv/wdqs/ is a symlink). RWStore.properties is edited to adjust the location of the journal file, which we have in /var/lib/wdqs/factgrid.jnl; mwservices.conf is edited to add database.factgrid.de to the allowed MWAPI endpoints; whitelist.txt is added to allow SPARQL federation with the following endpoints:

The query service itself runs as the blazegraph.service systemd unit (run systemctl cat blazegraph to see the configuration files). Its standard output and error go to the journal, and can be viewed by administrators with journalctl -u blazegraph (add -e for the latest messages).

Apache2 is configured (/etc/apache2/sites-available/001-factgrid-ssl.conf) to forward requests to /sparql to Blazegraph. It adds Blazegraph-specific request headers to enforce a max query time (60 seconds) and read-only mode, and an Access-Control-Allow-Origin response header to allow client-side JavaScript code to read query responses without restrictions.

The updater for the query service, which reads updates from the wiki’s recent changes and applies them to the query service, similarly runs as blazegraph-update.service.

The query service UI is cloned in ~factgrid/wikidata-query-gui/. It can be built using npm run build, and the resulting build/ directory is then copied into /var/www/, with a symlink /var/www/query pointing to the latest version. A few of the files in the repository have uncommitted changes specific to FactGrid; before updating the GUI, they have to be stashed away.

git stash save &&
git pull &&
git stash pop &&
npm install &&
npm run build &&
cp -a custom-config.json factgrid.png build/ &&
now=$(date -Iseconds) &&
cp -a build/ /var/www/query-"$now" &&
ln -sfT query-"$now" /var/www/query # atomically update symlink
# optional: remove the old /var/www/query-* directory

Dumps

dump-json.service creates a gzip-compressed JSON dump in /srv/dumps/, named after the current date (ISO 8601 format). dump-json.timer runs that service each day at 21:00 (CET). /srv/dumps/ is symlinked into /var/www/ (i.e. https://database.factgrid.de/dumps/); systemd-tmpfiles-clean.service, configured via /etc/tmpfiles.d/dumps.conf, removes dumps after 90 days.

Reconciliation service

An instance of the openrefine-wikibase service is installed in /home/factgrid/openrefine-wikibase/, with dependencies in a venv under .venv/ and configuration in config.py. (Prior to the upgrade to Debian 11 / Bullseye, it used a locally built Python 3.9.9 with sources in /home/factgrid/Python-3.9.9/, installed using make altinstall under prefix /usr/local/; this old Python is mostly still around, because Python doesn’t provide a make uninstall command, but it’s no longer used, and I manually renamed the /usr/local/bin files to avoid confusion. Several hacks are required to make the code run under Python 3.11.) openrefine-wikibase.service runs the service on localhost, port 8000; Apache is configured to proxy https://database.factgrid.de/reconcile/ to this service, which means the actual reconciliation service URL to configure in OpenRefine is https://database.factgrid.de/reconcile/en/api, or https://database.factgrid.de/reconcile/de/api for German labels/descriptions. A Wikibase manifest for OpenRefine is available at https://database.factgrid.de/factgrid-manifest.json.

ElasticSearch

ElasticSearch is installed via the 7.10.2 .deb package, with the org.wikimedia.search:extra:7.10.2-wmf4 and org.wikimedia.search.highlighter:experimental-highlighter-elasticsearch-plugin:7.10.2 plugins installed via /usr/share/elasticsearch/bin/elasticsearch-plugin install name:version. CirrusSearch and WikibaseCirrusSearch are installed, mainly according to the CirrusSearch README; note that $wgWBCSUseCirrus must already be true when the search index is initialized. $wgWBRepoSettings['searchIndexTypes'] lists the same property data types to index for haswbstatement search as in production: string, external-id, url, wikibase-item, wikibase-property, wikibase-lexeme, wikibase-form, wikibase-sense.