FactGrid:Setup: Difference between revisions

From FactGrid
Jump to navigation Jump to search
(Create page)
 
 
(34 intermediate revisions by 2 users not shown)
Line 1: Line 1:
This page describes the technical setup of the FactGrid website and services.
This page describes the technical setup of the FactGrid website and services.
FactGrid currently runs on a single virtual server, and all the file system paths mentioned here refer to that server.
See also [[/1.39 upgrade]] for a description of the process that was used to upgrade FactGrid from MediaWiki 1.35 to 1.39.
== Database Details ==
* CPU: laut /proc/cpuinfo 4× Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
* RAM: 7.7 GiB bzw. 8.1 GB laut free, 8068724 kB laut /proc/meminfo (zzgl. 7.9 GiB bzw. 8.3 GB swap)
* free-Schnappschuss (niedrige Last): 3.3 GiB used, 4.3 buff/cache
* HD: 976 GiB bzw. 1.1 TB laut df, ext4, über LVM (aber soweit ich sehe nur auf einer Festplatte, die wiederum ist aber laut lsblk virtuell (s.u.); davon verwendet: 133 GiB bzw. 143 GB, also 15% Festplattenauslastung
* VM: vmware laut systemd-detect-virt
* OS: Debian GNU/Linux 9 (stretch) laut /etc/os-release; allerdings php7.4 statt php7.0 (von packages.sury.org/php)
Das ist das System, auf sowohl das Wiki (Webserver, PHP) als auch der Query Service (Blazegraph plus Updater) laufen (d.h. ist bis jetzt nicht über mehrere Systeme verteilt worden). Details zum Setup im Folgenden:
== Packages ==
To install PHP 7.4 instead of 7.0 (which Debian Stretch ships but MediaWiki 1.35 is no longer compatible with), I used the package archive of the Debian PHP maintainer, following the [https://packages.sury.org/php/README.txt packages.sury.org/php README]. (I confirmed that sury.org belongs to the Debian PHP maintainer by checking the [https://qa.debian.org/developer.php?login=ondrej%40debian.org QA page] linked on the [https://packages.debian.org/stretch/php7.0 Stretch PHP 7.0 package].)
Additional packages installed include:
* php-dom for MediaWiki
* php-mbstring for MediaWiki
* php-xml for MediaWiki
* php-gmp for MediaWiki (suggested by wikimedia/avro, not sure if needed but can’t hurt)
* php-intl for Unicode support in QuickStatements
* for building a local Python (for OpenRefine-Wikibase reconciliation service):
** build-essential
** libssl1.0-dev
** libreadline-dev
** zlib1g-dev
** libffi-dev
* redis-server for OpenRefine-Wikibase reconciliation service
This list is probably incomplete. I hope to add to it in the future if any further packages are installed, but many existing installed packages are not recorded here.


== MediaWiki ==
== MediaWiki ==


MediaWiki is installed under <code>/var/www/w/</code>.
MediaWiki is installed as a Git clone of the REL1_39 branch under <code>/var/www/w-1.39/</code>, symlinked into <code>/var/www/w/</code>.
Apache serves <code>/var/www/</code> as document root,
Apache serves <code>/var/www/</code> as document root,
with the standard [[mw:Manual:Short URL/Apache|MediaWiki short URL setup]] to rewrite <code>/wiki/</code> into <code>/w/index.php</code>.
with the standard [[mw:Manual:Short URL/Apache|MediaWiki short URL setup]] to rewrite <code>/wiki/</code> into <code>/w/index.php</code>.
MediaWiki extensions and skins are checked out as Git repositories
(some of them are registered as submodules in the REL1_39 branch),
but <code>vendor/</code> is installed via Composer,
instead of using mediawiki-vendor.
(A <code>composer.local.json</code> file instructs Composer to include dependencies of extensions and skins.)
Image uploads are enabled (<code>images</code> is owned by <code>www-data:www-data</code>).
The [[mw:Manual:Job queue|job queue]] is processed by the <code>mediawiki-jobqueue.service</code> unit,
which is configured to frequently restart itself,
to avoid having outdated PHP code run for too long as well as out-of-memory errors.


== QuickStatements ==
== QuickStatements ==


The git repositories for [[phabricator:source/tool-quickstatements/|quickstatements]] and its dependency [https://bitbucket.org/magnusmanske/magnustools magnustools] are cloned under <code>/srv/</code>,
The git repositories for [https://github.com/magnusmanske/quickstatements quickstatements] and its dependency [https://bitbucket.org/magnusmanske/magnustools magnustools] are cloned under <code>/srv/</code>,
and symlinks in <code>/var/www/</code> point into their <code>public_html/</code> subdirectories.
and symlinks in <code>/var/www/</code> point into their <code>public_html/</code> subdirectories.
There is an <code>oauth.ini</code> configuration file in <code>/srv/quickstatements/</code>
(The clones were originally named <code>/srv/quickstatements</code> and <code>/srv/magnustools</code>,
but newer versions, cloned under <code>/srv/quickstatements_2023</code> and <code>/srv/magnustools_2023</code>, are used since 26 February 2023.)
There is an <code>oauth.ini</code> configuration file in <code>/srv/quickstatements_2023/</code>
(for [[Special:OAuthListConsumers/view/05133fe786f2fe4d0edbe4490e0a313a|this consumer]],
(for [[Special:OAuthListConsumers/view/05133fe786f2fe4d0edbe4490e0a313a|this consumer]],
with a request modeled after [[d:Special:OAuthListConsumers/view/77b4ae5506dd7dbb0bb07f80e3ae3ca9|the original Wikidata consumer]]),
with a request modeled after [[d:Special:OAuthListConsumers/view/77b4ae5506dd7dbb0bb07f80e3ae3ca9|the original Wikidata consumer]]),
and a <code>config.json</code> file in <code>/src/quickstatements/public_html/</code> describes the URL layout of the FactGrid site
and a <code>config.json</code> file in <code>/src/quickstatements_2023/public_html/</code> describes the URL layout of the FactGrid site
and selects FactGrid as the site to use.
and selects FactGrid as the site to use.
Logs go to <code>/srv/quickstatements/tool.log</code>.
Logs go to <code>/srv/quickstatements_2023/tool.log</code>,
which is owned by the <code>www-data</code> group and group-writable.
 
Batches which the user requests to run in the background,
instead of directly in the browser,
are saved to the <code>quickstatements_2023</code> database,
to which the <code>quickstatements_2023</code> SQL user has access;
both the <code>openDbTool()</code> calls and <code>setAuthDbName()</code> method in QuickStatements and the <code>openDbTool()</code> function in Magnustools
have been patched to access this database instead of the normal (very Toolforge-specific) database access code,
using the password residing in the <code>/srv/quickstatements_2023/db-password</code> file,
which is owned by the <code>www-data</code> group and group- but not world-readable.
QuickStatements has also been patched to format batch links in its edit summaries
using the <code>quickstatements:</code> link prefix,
instead of the usual <code>toollabs:quickstatements/</code>;
the <code>quickstatements:</code> interwiki prefix was installed with the following command
(via the <code>maintenance/sql.php</code> script):
 
INSERT INTO factgridinterwiki (iw_prefix, iw_url, iw_local, iw_trans) VALUES ('quickstatements', '/quickstatements/$1', 1, 0);
 
The bot which actually processes the batches runs as <code>quickstatements-bot.service</code>,
loading batches from the database and sending the appropriate edit requests to the API.
(When it has nothing to do, it sleeps in one-second intervals.)
 
Make sure to run <code>systemctl restart quickstatements-bot</code> whenever code changes to QuickStatements are made,
otherwise the bot will not pick them up.
 
== Reasonator ==


The bot has not been set up yet, so “run in background” doesn’t currently work.
The git repository for [https://bitbucket.org/magnusmanske/reasonator/ reasonator] is cloned under <code>/srv</code>,
and a symlink in <code>/var/www/</code> points into its <code>public_html/v2/</code> subdirectory.
<code>config.json</code> is copied from <code>config.json.template</code>
with some property IDs replaced with their FactGrid equivalent,
a few replaced with “TODO”,
and most other property IDs completely removed because they don’t apply to FactGrid.
There are also minor uncommitted changes in <code>vue.js</code> (avoid CORS errors) and <code>main-page.html</code> (replace example items),
though hopefully those should become unnecessary in the future.


== Query service ==
== Query service ==
Line 24: Line 105:
Upstream instructions:
Upstream instructions:
* [https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md getting started]
* [https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md getting started]
* [[mw:Wikidata Query Service/Implementation/Standalone|standalone setup]]
* [[:mw:Wikidata Query Service/Implementation/Standalone|standalone setup]]


The query service source is cloned in <code>~factgrid/wikidata-query-rdf/</code>,
The query service source is cloned in <code>~factgrid/wikidata-query-rdf/</code>,
built using ant as described in the “getting started” document,
built using ant as described in the “getting started” document,
and unzipped into <code>/srv/wdqs-0.3.1-SNAPSHOT/</code>
and unzipped into <code>/srv/wdqs-0.3.97-SNAPSHOT/</code>
(to which <code>/srv/wdqs/</code> is a symlink).
(to which <code>/srv/wdqs/</code> is a symlink).
<code>RWStore.properties</code> is edited to adjust the location of the journal file,
<code>RWStore.properties</code> is edited to adjust the location of the journal file,
which we have in <code>/var/lib/wdqs/factgrid.jnl</code>.
which we have in <code>/var/lib/wdqs/factgrid.jnl</code>;
There is also a nearly-empty <code>mwservices.conf</code> file (<code>{"services":{},"endpoints":[]}</code>)
<code>mwservices.conf</code> is edited to add <code>database.factgrid.de</code> to the allowed [[:mw:Wikidata Query Service/User Manual/MWAPI|MWAPI]] endpoints;
to avoid a warning if that file is missing completely.
<code>whitelist.txt</code> is added to allow SPARQL federation with the following endpoints:
* [https://query.wikidata.org/sparql WDQS] (<code>SERVICE <https://query.wikidata.org/sparql> { ... }</code>)
* [https://dbpedia.org/sparql DBpedia] (<code>SERVICE <https://dbpedia.org/sparql> { ... }</code>)


The query service itself runs as the <code>blazegraph.service</code> systemd unit
The query service itself runs as the <code>blazegraph.service</code> systemd unit
Line 39: Line 122:
Its standard output and error go to the journal,
Its standard output and error go to the journal,
and can be viewed by administrators with <code>journalctl -u blazegraph</code> (add <code>-e</code> for the latest messages).
and can be viewed by administrators with <code>journalctl -u blazegraph</code> (add <code>-e</code> for the latest messages).
Apache2 is configured (<code>/etc/apache2/sites-available/001-factgrid-ssl.conf</code>)
to forward requests to <code>/sparql</code> to Blazegraph.
It adds Blazegraph-specific request headers to enforce a max query time (60 seconds) and read-only mode,
and an <code>Access-Control-Allow-Origin</code> response header to allow client-side JavaScript code to read query responses without restrictions.


The updater for the query service,
The updater for the query service,
Line 44: Line 132:
similarly runs as <code>blazegraph-update.service</code>.
similarly runs as <code>blazegraph-update.service</code>.


The query service UI is cloned in <code>~factgrid/wikidata-query-rdf/gui/</code>.
The query service UI is cloned in <code>~factgrid/wikidata-query-gui/</code>.
It can be built using <code>npm run build</code>,
It can be built using <code>npm run build</code>,
and the resulting <code>build/</code> directory is then copied into <code>/var/www/</code>,
and the resulting <code>build/</code> directory is then copied into <code>/var/www/</code>,
Line 55: Line 143:
git pull &&
git pull &&
git stash pop &&
git stash pop &&
npm install &&
npm run build &&
npm run build &&
cp -a custom-config.json factgrid.png build/ &&
now=$(date -Iseconds) &&
now=$(date -Iseconds) &&
cp -a build/ /var/www/query-"$now" &&
cp -a build/ /var/www/query-"$now" &&
Line 61: Line 151:
# optional: remove the old /var/www/query-* directory
# optional: remove the old /var/www/query-* directory
</pre>
</pre>
== Dumps ==
<code>dump-json.service</code> creates a gzip-compressed JSON dump in <code>/srv/dumps/</code>, named after the current date (ISO 8601 format).
<code>dump-json.timer</code> runs that service each day at 21:00 (CET).
<code>/srv/dumps/</code> is symlinked into <code>/var/www/</code> (i.e. https://database.factgrid.de/dumps/);
<code>systemd-tmpfiles-clean.service</code>, configured via <code>/etc/tmpfiles.d/dumps.conf</code>, removes dumps after 90 days.
== Reconciliation service ==
An instance of the [https://github.com/wetneb/openrefine-wikibase openrefine-wikibase] service is installed in <code>/home/factgrid/openrefine-wikibase/</code>,
using a locally built Python 3.9.9 (sources in <code>/home/factgrid/Python-3.9.9/</code>, installed using <code>make altinstall</code> under prefix <code>/usr/local/</code>),
dependencies in a venv under <code>.venv/</code>,
and configuration in <code>config.py</code>.
<code>openrefine-wikibase.service</code> runs the service on localhost, port 8000;
Apache is configured to proxy https://database.factgrid.de/reconcile/ to this service,
which means the actual reconciliation service URL to configure in OpenRefine is '''https://database.factgrid.de/reconcile/en/api''',
or '''https://database.factgrid.de/reconcile/de/api''' for German labels/descriptions.
A Wikibase manifest for OpenRefine is available at '''https://database.factgrid.de/factgrid-manifest.json'''.
== ElasticSearch ==
ElasticSearch is installed via the [https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.10.2-amd64.deb 7.10.2 .deb package],
with the <code>org.wikimedia.search:extra:7.10.2-wmf4</code> and <code>org.wikimedia.search.highlighter:experimental-highlighter-elasticsearch-plugin:7.10.2</code> plugins installed via <code>/usr/share/elasticsearch/bin/elasticsearch-plugin install <var>name</var>:<var>version</var></code>.
[[mw:extension:CirrusSearch|CirrusSearch]] and [[mw:extension:WikibaseCirrusSearch|WikibaseCirrusSearch]] are installed, mainly according to the CirrusSearch README;
note that <code>$wgWBCSUseCirrus</code> must already be <code>true</code> when the search index is initialized.
<code>$wgWBRepoSettings['searchIndexTypes']</code> lists the same [[Special:ListDataTypes|property data types]] to index for <code>haswbstatement</code> search as in production:
<code>string</code>, <code>external-id</code>, <code>url</code>, <code>wikibase-item</code>, <code>wikibase-property</code>, <code>wikibase-lexeme</code>, <code>wikibase-form</code>, <code>wikibase-sense</code>.
[[Category:FactGrid Technical]]

Latest revision as of 20:58, 24 November 2023

This page describes the technical setup of the FactGrid website and services. FactGrid currently runs on a single virtual server, and all the file system paths mentioned here refer to that server.

See also /1.39 upgrade for a description of the process that was used to upgrade FactGrid from MediaWiki 1.35 to 1.39.

Database Details

  • CPU: laut /proc/cpuinfo 4× Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
  • RAM: 7.7 GiB bzw. 8.1 GB laut free, 8068724 kB laut /proc/meminfo (zzgl. 7.9 GiB bzw. 8.3 GB swap)
  • free-Schnappschuss (niedrige Last): 3.3 GiB used, 4.3 buff/cache
  • HD: 976 GiB bzw. 1.1 TB laut df, ext4, über LVM (aber soweit ich sehe nur auf einer Festplatte, die wiederum ist aber laut lsblk virtuell (s.u.); davon verwendet: 133 GiB bzw. 143 GB, also 15% Festplattenauslastung
  • VM: vmware laut systemd-detect-virt
  • OS: Debian GNU/Linux 9 (stretch) laut /etc/os-release; allerdings php7.4 statt php7.0 (von packages.sury.org/php)

Das ist das System, auf sowohl das Wiki (Webserver, PHP) als auch der Query Service (Blazegraph plus Updater) laufen (d.h. ist bis jetzt nicht über mehrere Systeme verteilt worden). Details zum Setup im Folgenden:

Packages

To install PHP 7.4 instead of 7.0 (which Debian Stretch ships but MediaWiki 1.35 is no longer compatible with), I used the package archive of the Debian PHP maintainer, following the packages.sury.org/php README. (I confirmed that sury.org belongs to the Debian PHP maintainer by checking the QA page linked on the Stretch PHP 7.0 package.)

Additional packages installed include:

  • php-dom for MediaWiki
  • php-mbstring for MediaWiki
  • php-xml for MediaWiki
  • php-gmp for MediaWiki (suggested by wikimedia/avro, not sure if needed but can’t hurt)
  • php-intl for Unicode support in QuickStatements
  • for building a local Python (for OpenRefine-Wikibase reconciliation service):
    • build-essential
    • libssl1.0-dev
    • libreadline-dev
    • zlib1g-dev
    • libffi-dev
  • redis-server for OpenRefine-Wikibase reconciliation service

This list is probably incomplete. I hope to add to it in the future if any further packages are installed, but many existing installed packages are not recorded here.

MediaWiki

MediaWiki is installed as a Git clone of the REL1_39 branch under /var/www/w-1.39/, symlinked into /var/www/w/. Apache serves /var/www/ as document root, with the standard MediaWiki short URL setup to rewrite /wiki/ into /w/index.php. MediaWiki extensions and skins are checked out as Git repositories (some of them are registered as submodules in the REL1_39 branch), but vendor/ is installed via Composer, instead of using mediawiki-vendor. (A composer.local.json file instructs Composer to include dependencies of extensions and skins.) Image uploads are enabled (images is owned by www-data:www-data).

The job queue is processed by the mediawiki-jobqueue.service unit, which is configured to frequently restart itself, to avoid having outdated PHP code run for too long as well as out-of-memory errors.

QuickStatements

The git repositories for quickstatements and its dependency magnustools are cloned under /srv/, and symlinks in /var/www/ point into their public_html/ subdirectories. (The clones were originally named /srv/quickstatements and /srv/magnustools, but newer versions, cloned under /srv/quickstatements_2023 and /srv/magnustools_2023, are used since 26 February 2023.) There is an oauth.ini configuration file in /srv/quickstatements_2023/ (for this consumer, with a request modeled after the original Wikidata consumer), and a config.json file in /src/quickstatements_2023/public_html/ describes the URL layout of the FactGrid site and selects FactGrid as the site to use. Logs go to /srv/quickstatements_2023/tool.log, which is owned by the www-data group and group-writable.

Batches which the user requests to run in the background, instead of directly in the browser, are saved to the quickstatements_2023 database, to which the quickstatements_2023 SQL user has access; both the openDbTool() calls and setAuthDbName() method in QuickStatements and the openDbTool() function in Magnustools have been patched to access this database instead of the normal (very Toolforge-specific) database access code, using the password residing in the /srv/quickstatements_2023/db-password file, which is owned by the www-data group and group- but not world-readable. QuickStatements has also been patched to format batch links in its edit summaries using the quickstatements: link prefix, instead of the usual toollabs:quickstatements/; the quickstatements: interwiki prefix was installed with the following command (via the maintenance/sql.php script):

INSERT INTO factgridinterwiki (iw_prefix, iw_url, iw_local, iw_trans) VALUES ('quickstatements', '/quickstatements/$1', 1, 0);

The bot which actually processes the batches runs as quickstatements-bot.service, loading batches from the database and sending the appropriate edit requests to the API. (When it has nothing to do, it sleeps in one-second intervals.)

Make sure to run systemctl restart quickstatements-bot whenever code changes to QuickStatements are made, otherwise the bot will not pick them up.

Reasonator

The git repository for reasonator is cloned under /srv, and a symlink in /var/www/ points into its public_html/v2/ subdirectory. config.json is copied from config.json.template with some property IDs replaced with their FactGrid equivalent, a few replaced with “TODO”, and most other property IDs completely removed because they don’t apply to FactGrid. There are also minor uncommitted changes in vue.js (avoid CORS errors) and main-page.html (replace example items), though hopefully those should become unnecessary in the future.

Query service

Upstream instructions:

The query service source is cloned in ~factgrid/wikidata-query-rdf/, built using ant as described in the “getting started” document, and unzipped into /srv/wdqs-0.3.97-SNAPSHOT/ (to which /srv/wdqs/ is a symlink). RWStore.properties is edited to adjust the location of the journal file, which we have in /var/lib/wdqs/factgrid.jnl; mwservices.conf is edited to add database.factgrid.de to the allowed MWAPI endpoints; whitelist.txt is added to allow SPARQL federation with the following endpoints:

The query service itself runs as the blazegraph.service systemd unit (run systemctl cat blazegraph to see the configuration files). Its standard output and error go to the journal, and can be viewed by administrators with journalctl -u blazegraph (add -e for the latest messages).

Apache2 is configured (/etc/apache2/sites-available/001-factgrid-ssl.conf) to forward requests to /sparql to Blazegraph. It adds Blazegraph-specific request headers to enforce a max query time (60 seconds) and read-only mode, and an Access-Control-Allow-Origin response header to allow client-side JavaScript code to read query responses without restrictions.

The updater for the query service, which reads updates from the wiki’s recent changes and applies them to the query service, similarly runs as blazegraph-update.service.

The query service UI is cloned in ~factgrid/wikidata-query-gui/. It can be built using npm run build, and the resulting build/ directory is then copied into /var/www/, with a symlink /var/www/query pointing to the latest version. A few of the files in the repository have uncommitted changes specific to FactGrid; before updating the GUI, they have to be stashed away.

git stash save &&
git pull &&
git stash pop &&
npm install &&
npm run build &&
cp -a custom-config.json factgrid.png build/ &&
now=$(date -Iseconds) &&
cp -a build/ /var/www/query-"$now" &&
ln -sfT query-"$now" /var/www/query # atomically update symlink
# optional: remove the old /var/www/query-* directory

Dumps

dump-json.service creates a gzip-compressed JSON dump in /srv/dumps/, named after the current date (ISO 8601 format). dump-json.timer runs that service each day at 21:00 (CET). /srv/dumps/ is symlinked into /var/www/ (i.e. https://database.factgrid.de/dumps/); systemd-tmpfiles-clean.service, configured via /etc/tmpfiles.d/dumps.conf, removes dumps after 90 days.

Reconciliation service

An instance of the openrefine-wikibase service is installed in /home/factgrid/openrefine-wikibase/, using a locally built Python 3.9.9 (sources in /home/factgrid/Python-3.9.9/, installed using make altinstall under prefix /usr/local/), dependencies in a venv under .venv/, and configuration in config.py. openrefine-wikibase.service runs the service on localhost, port 8000; Apache is configured to proxy https://database.factgrid.de/reconcile/ to this service, which means the actual reconciliation service URL to configure in OpenRefine is https://database.factgrid.de/reconcile/en/api, or https://database.factgrid.de/reconcile/de/api for German labels/descriptions. A Wikibase manifest for OpenRefine is available at https://database.factgrid.de/factgrid-manifest.json.

ElasticSearch

ElasticSearch is installed via the 7.10.2 .deb package, with the org.wikimedia.search:extra:7.10.2-wmf4 and org.wikimedia.search.highlighter:experimental-highlighter-elasticsearch-plugin:7.10.2 plugins installed via /usr/share/elasticsearch/bin/elasticsearch-plugin install name:version. CirrusSearch and WikibaseCirrusSearch are installed, mainly according to the CirrusSearch README; note that $wgWBCSUseCirrus must already be true when the search index is initialized. $wgWBRepoSettings['searchIndexTypes'] lists the same property data types to index for haswbstatement search as in production: string, external-id, url, wikibase-item, wikibase-property, wikibase-lexeme, wikibase-form, wikibase-sense.