FactGrid:Setup: Difference between revisions
(→Packages: add redis-server) |
(→Reconciliation service: link hacks) |
||
(19 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
This page describes the technical setup of the FactGrid website and services. | This page describes the technical setup of the FactGrid website and services. | ||
FactGrid currently runs on a single virtual server, and all the file system paths mentioned here refer to that server. | FactGrid currently runs on a single virtual server, and all the file system paths mentioned here refer to that server. | ||
See also [[/1.39 upgrade]] for a description of the process that was used to upgrade FactGrid from MediaWiki 1.35 to 1.39. | |||
== Database Details == | == Database Details == | ||
Line 15: | Line 17: | ||
== Packages == | == Packages == | ||
Additional packages installed include: | Additional packages installed include: | ||
Line 25: | Line 25: | ||
* php-gmp for MediaWiki (suggested by wikimedia/avro, not sure if needed but can’t hurt) | * php-gmp for MediaWiki (suggested by wikimedia/avro, not sure if needed but can’t hurt) | ||
* php-intl for Unicode support in QuickStatements | * php-intl for Unicode support in QuickStatements | ||
* for building a local Python (for OpenRefine-Wikibase reconciliation service): | * php-curl for Elastica / CirrusSearch | ||
* for building a local Python (for OpenRefine-Wikibase reconciliation service) (with the upgrade to Debian Bullseye, this is probably no longer needed): | |||
** build-essential | ** build-essential | ||
** libssl1.0-dev | ** libssl1.0-dev | ||
** libreadline-dev | ** libreadline-dev | ||
** zlib1g-dev | ** zlib1g-dev | ||
** libffi-dev | |||
* redis-server for OpenRefine-Wikibase reconciliation service | * redis-server for OpenRefine-Wikibase reconciliation service | ||
Line 36: | Line 38: | ||
== MediaWiki == | == MediaWiki == | ||
MediaWiki is installed as a Git clone of the | MediaWiki is installed as a Git clone of the REL1_39 branch under <code>/var/www/w-1.39/</code>, symlinked into <code>/var/www/w/</code>. | ||
Apache serves <code>/var/www/</code> as document root, | Apache serves <code>/var/www/</code> as document root, | ||
with the standard [[mw:Manual:Short URL/Apache|MediaWiki short URL setup]] to rewrite <code>/wiki/</code> into <code>/w/index.php</code>. | with the standard [[mw:Manual:Short URL/Apache|MediaWiki short URL setup]] to rewrite <code>/wiki/</code> into <code>/w/index.php</code>. | ||
MediaWiki extensions and skins are checked out as Git repositories | MediaWiki extensions and skins are checked out as Git repositories | ||
(some of them are registered as submodules in the | (some of them are registered as submodules in the REL1_39 branch), | ||
but <code>vendor/</code> is installed via Composer, | but <code>vendor/</code> is installed via Composer, | ||
instead of using mediawiki-vendor. | instead of using mediawiki-vendor. | ||
Line 46: | Line 48: | ||
Image uploads are enabled (<code>images</code> is owned by <code>www-data:www-data</code>). | Image uploads are enabled (<code>images</code> is owned by <code>www-data:www-data</code>). | ||
The [[mw:Manual:Job queue|job queue]] is processed by the <code>mediawiki-jobqueue.service</code> unit. | The [[mw:Manual:Job queue|job queue]] is processed by the <code>mediawiki-jobqueue.service</code> unit, | ||
which is configured to frequently restart itself, | |||
to avoid having outdated PHP code run for too long as well as out-of-memory errors. | |||
A daily <code>mediawiki-jobqueue-restart.timer</code> additionally restarts the job queue service, | |||
to avoid situations where the job queue fails to start due to database errors and systemd gives up on restarting it forever. | |||
== QuickStatements == | == QuickStatements == | ||
The git repositories for [ | The git repositories for [https://github.com/magnusmanske/quickstatements quickstatements] and its dependency [https://bitbucket.org/magnusmanske/magnustools magnustools] are cloned under <code>/srv/</code>, | ||
and symlinks in <code>/var/www/</code> point into their <code>public_html/</code> subdirectories. | and symlinks in <code>/var/www/</code> point into their <code>public_html/</code> subdirectories. | ||
There is an <code>oauth.ini</code> configuration file in <code>/srv/ | (The clones were originally named <code>/srv/quickstatements</code> and <code>/srv/magnustools</code>, | ||
but newer versions, cloned under <code>/srv/quickstatements_2023</code> and <code>/srv/magnustools_2023</code>, are used since 26 February 2023.) | |||
There is an <code>oauth.ini</code> configuration file in <code>/srv/quickstatements_2023/</code> | |||
(for [[Special:OAuthListConsumers/view/05133fe786f2fe4d0edbe4490e0a313a|this consumer]], | (for [[Special:OAuthListConsumers/view/05133fe786f2fe4d0edbe4490e0a313a|this consumer]], | ||
with a request modeled after [[d:Special:OAuthListConsumers/view/77b4ae5506dd7dbb0bb07f80e3ae3ca9|the original Wikidata consumer]]), | with a request modeled after [[d:Special:OAuthListConsumers/view/77b4ae5506dd7dbb0bb07f80e3ae3ca9|the original Wikidata consumer]]), | ||
and a <code>config.json</code> file in <code>/src/ | and a <code>config.json</code> file in <code>/src/quickstatements_2023/public_html/</code> describes the URL layout of the FactGrid site | ||
and selects FactGrid as the site to use. | and selects FactGrid as the site to use. | ||
Logs go to <code>/srv/ | Logs go to <code>/srv/quickstatements_2023/tool.log</code>, | ||
which is owned by the <code>www-data</code> group and group-writable. | which is owned by the <code>www-data</code> group and group-writable. | ||
Batches which the user requests to run in the background, | Batches which the user requests to run in the background, | ||
instead of directly in the browser, | instead of directly in the browser, | ||
are saved to the <code> | are saved to the <code>quickstatements_2023</code> database, | ||
to which the <code> | to which the <code>quickstatements_2023</code> SQL user has access; | ||
both the <code>openDbTool()</code> calls and <code>setAuthDbName()</code> method in QuickStatements and the <code>openDbTool()</code> function in Magnustools | |||
have been patched to access this database instead of the normal (very Toolforge-specific) database access code, | |||
using the password residing in the <code>/srv/quickstatements_2023/db-password</code> file, | |||
which is owned by the <code>www-data</code> group and group- but not world-readable. | which is owned by the <code>www-data</code> group and group- but not world-readable. | ||
QuickStatements has also been patched to format batch links in its edit summaries | QuickStatements has also been patched to format batch links in its edit summaries | ||
Line 98: | Line 106: | ||
Upstream instructions: | Upstream instructions: | ||
* [https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md getting started] | * [https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md getting started] | ||
* [[mw:Wikidata Query Service/Implementation/Standalone|standalone setup]] | * [[:mw:Wikidata Query Service/Implementation/Standalone|standalone setup]] | ||
The query service source is cloned in <code>~factgrid/wikidata-query-rdf/</code>, | The query service source is cloned in <code>~factgrid/wikidata-query-rdf/</code>, | ||
built using ant as described in the “getting started” document, | built using ant as described in the “getting started” document, | ||
and unzipped into <code>/srv/wdqs-0.3. | and unzipped into <code>/srv/wdqs-0.3.97-SNAPSHOT/</code> | ||
(to which <code>/srv/wdqs/</code> is a symlink). | (to which <code>/srv/wdqs/</code> is a symlink). | ||
<code>RWStore.properties</code> is edited to adjust the location of the journal file, | <code>RWStore.properties</code> is edited to adjust the location of the journal file, | ||
which we have in <code>/var/lib/wdqs/factgrid.jnl</code> | which we have in <code>/var/lib/wdqs/factgrid.jnl</code>; | ||
<code>mwservices.conf</code> is edited to add <code>database.factgrid.de</code> to the allowed [[:mw:Wikidata Query Service/User Manual/MWAPI|MWAPI]] endpoints; | |||
<code>whitelist.txt</code> is added to allow SPARQL federation with the following endpoints: | |||
* [https://query.wikidata.org/sparql WDQS] (<code>SERVICE <https://query.wikidata.org/sparql> { ... }</code>) | |||
* [https://dbpedia.org/sparql DBpedia] (<code>SERVICE <https://dbpedia.org/sparql> { ... }</code>) | |||
The query service itself runs as the <code>blazegraph.service</code> systemd unit | The query service itself runs as the <code>blazegraph.service</code> systemd unit | ||
Line 123: | Line 133: | ||
similarly runs as <code>blazegraph-update.service</code>. | similarly runs as <code>blazegraph-update.service</code>. | ||
The query service UI is cloned in <code>~factgrid/wikidata-query- | The query service UI is cloned in <code>~factgrid/wikidata-query-gui/</code>. | ||
It can be built using <code>npm run build</code>, | It can be built using <code>npm run build</code>, | ||
and the resulting <code>build/</code> directory is then copied into <code>/var/www/</code>, | and the resulting <code>build/</code> directory is then copied into <code>/var/www/</code>, | ||
Line 136: | Line 146: | ||
npm install && | npm install && | ||
npm run build && | npm run build && | ||
cp -a custom-config.json build/ && | cp -a custom-config.json factgrid.png build/ && | ||
now=$(date -Iseconds) && | now=$(date -Iseconds) && | ||
cp -a build/ /var/www/query-"$now" && | cp -a build/ /var/www/query-"$now" && | ||
Line 147: | Line 157: | ||
<code>dump-json.service</code> creates a gzip-compressed JSON dump in <code>/srv/dumps/</code>, named after the current date (ISO 8601 format). | <code>dump-json.service</code> creates a gzip-compressed JSON dump in <code>/srv/dumps/</code>, named after the current date (ISO 8601 format). | ||
<code>dump-json.timer</code> runs that service each day at 21:00 (CET). | <code>dump-json.timer</code> runs that service each day at 21:00 (CET). | ||
<code>/srv/dumps/</code> is symlinked into <code>/var/www/</code>; | <code>/srv/dumps/</code> is symlinked into <code>/var/www/</code> (i.e. https://database.factgrid.de/dumps/); | ||
<code>systemd-tmpfiles-clean.service</code>, configured via <code>/etc/tmpfiles.d/dumps.conf</code>, | <code>systemd-tmpfiles-clean.service</code>, configured via <code>/etc/tmpfiles.d/dumps.conf</code>, removes dumps after 90 days. | ||
== Reconciliation service == | |||
An instance of the [https://github.com/wetneb/openrefine-wikibase openrefine-wikibase] service is installed in <code>/home/factgrid/openrefine-wikibase/</code>, | |||
with dependencies in a venv under <code>.venv/</code> and configuration in <code>config.py</code>. | |||
(Prior to the upgrade to Debian 11 / Bullseye, it used a locally built Python 3.9.9 with sources in <code>/home/factgrid/Python-3.9.9/</code>, installed using <code>make altinstall</code> under prefix <code>/usr/local/</code>; | |||
this old Python is mostly still around, because Python doesn’t provide a <code>make uninstall</code> command, but it’s no longer used, and I manually renamed the <code>/usr/local/bin</code> files to avoid confusion. | |||
Several [https://gist.github.com/lucaswerkmeister/3ae63110c3869204db1dae26af23814c hacks] are required to make the code run under Python 3.11.) | |||
<code>openrefine-wikibase.service</code> runs the service on localhost, port 8000; | |||
Apache is configured to proxy https://database.factgrid.de/reconcile/ to this service, | |||
which means the actual reconciliation service URL to configure in OpenRefine is '''https://database.factgrid.de/reconcile/en/api''', | |||
or '''https://database.factgrid.de/reconcile/de/api''' for German labels/descriptions. | |||
A Wikibase manifest for OpenRefine is available at '''https://database.factgrid.de/factgrid-manifest.json'''. | |||
== ElasticSearch == | |||
ElasticSearch is installed via the [https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.10.2-amd64.deb 7.10.2 .deb package], | |||
with the <code>org.wikimedia.search:extra:7.10.2-wmf4</code> and <code>org.wikimedia.search.highlighter:experimental-highlighter-elasticsearch-plugin:7.10.2</code> plugins installed via <code>/usr/share/elasticsearch/bin/elasticsearch-plugin install <var>name</var>:<var>version</var></code>. | |||
[[mw:extension:CirrusSearch|CirrusSearch]] and [[mw:extension:WikibaseCirrusSearch|WikibaseCirrusSearch]] are installed, mainly according to the CirrusSearch README; | |||
note that <code>$wgWBCSUseCirrus</code> must already be <code>true</code> when the search index is initialized. | |||
<code>$wgWBRepoSettings['searchIndexTypes']</code> lists the same [[Special:ListDataTypes|property data types]] to index for <code>haswbstatement</code> search as in production: | |||
<code>string</code>, <code>external-id</code>, <code>url</code>, <code>wikibase-item</code>, <code>wikibase-property</code>, <code>wikibase-lexeme</code>, <code>wikibase-form</code>, <code>wikibase-sense</code>. | |||
[[Category:FactGrid Technical]] |
Latest revision as of 10:42, 17 August 2024
This page describes the technical setup of the FactGrid website and services. FactGrid currently runs on a single virtual server, and all the file system paths mentioned here refer to that server.
See also /1.39 upgrade for a description of the process that was used to upgrade FactGrid from MediaWiki 1.35 to 1.39.
Database Details
- CPU: laut /proc/cpuinfo 4× Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
- RAM: 7.7 GiB bzw. 8.1 GB laut free, 8068724 kB laut /proc/meminfo (zzgl. 7.9 GiB bzw. 8.3 GB swap)
- free-Schnappschuss (niedrige Last): 3.3 GiB used, 4.3 buff/cache
- HD: 976 GiB bzw. 1.1 TB laut df, ext4, über LVM (aber soweit ich sehe nur auf einer Festplatte, die wiederum ist aber laut lsblk virtuell (s.u.); davon verwendet: 133 GiB bzw. 143 GB, also 15% Festplattenauslastung
- VM: vmware laut systemd-detect-virt
- OS: Debian GNU/Linux 9 (stretch) laut /etc/os-release; allerdings php7.4 statt php7.0 (von packages.sury.org/php)
Das ist das System, auf sowohl das Wiki (Webserver, PHP) als auch der Query Service (Blazegraph plus Updater) laufen (d.h. ist bis jetzt nicht über mehrere Systeme verteilt worden). Details zum Setup im Folgenden:
Packages
Additional packages installed include:
- php-dom for MediaWiki
- php-mbstring for MediaWiki
- php-xml for MediaWiki
- php-gmp for MediaWiki (suggested by wikimedia/avro, not sure if needed but can’t hurt)
- php-intl for Unicode support in QuickStatements
- php-curl for Elastica / CirrusSearch
- for building a local Python (for OpenRefine-Wikibase reconciliation service) (with the upgrade to Debian Bullseye, this is probably no longer needed):
- build-essential
- libssl1.0-dev
- libreadline-dev
- zlib1g-dev
- libffi-dev
- redis-server for OpenRefine-Wikibase reconciliation service
This list is probably incomplete. I hope to add to it in the future if any further packages are installed, but many existing installed packages are not recorded here.
MediaWiki
MediaWiki is installed as a Git clone of the REL1_39 branch under /var/www/w-1.39/
, symlinked into /var/www/w/
.
Apache serves /var/www/
as document root,
with the standard MediaWiki short URL setup to rewrite /wiki/
into /w/index.php
.
MediaWiki extensions and skins are checked out as Git repositories
(some of them are registered as submodules in the REL1_39 branch),
but vendor/
is installed via Composer,
instead of using mediawiki-vendor.
(A composer.local.json
file instructs Composer to include dependencies of extensions and skins.)
Image uploads are enabled (images
is owned by www-data:www-data
).
The job queue is processed by the mediawiki-jobqueue.service
unit,
which is configured to frequently restart itself,
to avoid having outdated PHP code run for too long as well as out-of-memory errors.
A daily mediawiki-jobqueue-restart.timer
additionally restarts the job queue service,
to avoid situations where the job queue fails to start due to database errors and systemd gives up on restarting it forever.
QuickStatements
The git repositories for quickstatements and its dependency magnustools are cloned under /srv/
,
and symlinks in /var/www/
point into their public_html/
subdirectories.
(The clones were originally named /srv/quickstatements
and /srv/magnustools
,
but newer versions, cloned under /srv/quickstatements_2023
and /srv/magnustools_2023
, are used since 26 February 2023.)
There is an oauth.ini
configuration file in /srv/quickstatements_2023/
(for this consumer,
with a request modeled after the original Wikidata consumer),
and a config.json
file in /src/quickstatements_2023/public_html/
describes the URL layout of the FactGrid site
and selects FactGrid as the site to use.
Logs go to /srv/quickstatements_2023/tool.log
,
which is owned by the www-data
group and group-writable.
Batches which the user requests to run in the background,
instead of directly in the browser,
are saved to the quickstatements_2023
database,
to which the quickstatements_2023
SQL user has access;
both the openDbTool()
calls and setAuthDbName()
method in QuickStatements and the openDbTool()
function in Magnustools
have been patched to access this database instead of the normal (very Toolforge-specific) database access code,
using the password residing in the /srv/quickstatements_2023/db-password
file,
which is owned by the www-data
group and group- but not world-readable.
QuickStatements has also been patched to format batch links in its edit summaries
using the quickstatements:
link prefix,
instead of the usual toollabs:quickstatements/
;
the quickstatements:
interwiki prefix was installed with the following command
(via the maintenance/sql.php
script):
INSERT INTO factgridinterwiki (iw_prefix, iw_url, iw_local, iw_trans) VALUES ('quickstatements', '/quickstatements/$1', 1, 0);
The bot which actually processes the batches runs as quickstatements-bot.service
,
loading batches from the database and sending the appropriate edit requests to the API.
(When it has nothing to do, it sleeps in one-second intervals.)
Make sure to run systemctl restart quickstatements-bot
whenever code changes to QuickStatements are made,
otherwise the bot will not pick them up.
Reasonator
The git repository for reasonator is cloned under /srv
,
and a symlink in /var/www/
points into its public_html/v2/
subdirectory.
config.json
is copied from config.json.template
with some property IDs replaced with their FactGrid equivalent,
a few replaced with “TODO”,
and most other property IDs completely removed because they don’t apply to FactGrid.
There are also minor uncommitted changes in vue.js
(avoid CORS errors) and main-page.html
(replace example items),
though hopefully those should become unnecessary in the future.
Query service
Upstream instructions:
The query service source is cloned in ~factgrid/wikidata-query-rdf/
,
built using ant as described in the “getting started” document,
and unzipped into /srv/wdqs-0.3.97-SNAPSHOT/
(to which /srv/wdqs/
is a symlink).
RWStore.properties
is edited to adjust the location of the journal file,
which we have in /var/lib/wdqs/factgrid.jnl
;
mwservices.conf
is edited to add database.factgrid.de
to the allowed MWAPI endpoints;
whitelist.txt
is added to allow SPARQL federation with the following endpoints:
- WDQS (
SERVICE <https://query.wikidata.org/sparql> { ... }
) - DBpedia (
SERVICE <https://dbpedia.org/sparql> { ... }
)
The query service itself runs as the blazegraph.service
systemd unit
(run systemctl cat blazegraph
to see the configuration files).
Its standard output and error go to the journal,
and can be viewed by administrators with journalctl -u blazegraph
(add -e
for the latest messages).
Apache2 is configured (/etc/apache2/sites-available/001-factgrid-ssl.conf
)
to forward requests to /sparql
to Blazegraph.
It adds Blazegraph-specific request headers to enforce a max query time (60 seconds) and read-only mode,
and an Access-Control-Allow-Origin
response header to allow client-side JavaScript code to read query responses without restrictions.
The updater for the query service,
which reads updates from the wiki’s recent changes and applies them to the query service,
similarly runs as blazegraph-update.service
.
The query service UI is cloned in ~factgrid/wikidata-query-gui/
.
It can be built using npm run build
,
and the resulting build/
directory is then copied into /var/www/
,
with a symlink /var/www/query
pointing to the latest version.
A few of the files in the repository have uncommitted changes specific to FactGrid;
before updating the GUI, they have to be stashed away.
git stash save && git pull && git stash pop && npm install && npm run build && cp -a custom-config.json factgrid.png build/ && now=$(date -Iseconds) && cp -a build/ /var/www/query-"$now" && ln -sfT query-"$now" /var/www/query # atomically update symlink # optional: remove the old /var/www/query-* directory
Dumps
dump-json.service
creates a gzip-compressed JSON dump in /srv/dumps/
, named after the current date (ISO 8601 format).
dump-json.timer
runs that service each day at 21:00 (CET).
/srv/dumps/
is symlinked into /var/www/
(i.e. https://database.factgrid.de/dumps/);
systemd-tmpfiles-clean.service
, configured via /etc/tmpfiles.d/dumps.conf
, removes dumps after 90 days.
Reconciliation service
An instance of the openrefine-wikibase service is installed in /home/factgrid/openrefine-wikibase/
,
with dependencies in a venv under .venv/
and configuration in config.py
.
(Prior to the upgrade to Debian 11 / Bullseye, it used a locally built Python 3.9.9 with sources in /home/factgrid/Python-3.9.9/
, installed using make altinstall
under prefix /usr/local/
;
this old Python is mostly still around, because Python doesn’t provide a make uninstall
command, but it’s no longer used, and I manually renamed the /usr/local/bin
files to avoid confusion.
Several hacks are required to make the code run under Python 3.11.)
openrefine-wikibase.service
runs the service on localhost, port 8000;
Apache is configured to proxy https://database.factgrid.de/reconcile/ to this service,
which means the actual reconciliation service URL to configure in OpenRefine is https://database.factgrid.de/reconcile/en/api,
or https://database.factgrid.de/reconcile/de/api for German labels/descriptions.
A Wikibase manifest for OpenRefine is available at https://database.factgrid.de/factgrid-manifest.json.
ElasticSearch
ElasticSearch is installed via the 7.10.2 .deb package,
with the org.wikimedia.search:extra:7.10.2-wmf4
and org.wikimedia.search.highlighter:experimental-highlighter-elasticsearch-plugin:7.10.2
plugins installed via /usr/share/elasticsearch/bin/elasticsearch-plugin install name:version
.
CirrusSearch and WikibaseCirrusSearch are installed, mainly according to the CirrusSearch README;
note that $wgWBCSUseCirrus
must already be true
when the search index is initialized.
$wgWBRepoSettings['searchIndexTypes']
lists the same property data types to index for haswbstatement
search as in production:
string
, external-id
, url
, wikibase-item
, wikibase-property
, wikibase-lexeme
, wikibase-form
, wikibase-sense
.