Instructions for Admin¶

Installation on Ubuntu 20.04¶

Run the following instructions as root (as administrator). If a file needs to be opened/modified, use the editor “nano” (type nano in front of the filename).

Start root bash:
$ sudo bash
Ubuntu packages
$ apt update
$ apt install -y git docker.io docker-compose
Clone this repository to a desired folder (here /opt) and enter the docker folder.
$ cd /opt
$ git clone https://sissource.ethz.ch/sispub/datapool2/datapool.git
$ cd datapool/docker
$ cp .env.example .env
Setup folders for the home of development landing zones, the operational landing zone, the backup landing zone, extra scripts you might want to have accessible for the datapool. E.g.
$ mkdir -p /data/landing_zone
$ mkdir -p /data/backup_landing_zone
$ mkdir -p /data/development
$ mkdir -p /data/extra
Beyond that also create folders for the data base files and datapool log files:
$ mkdir -p /data/postgres
$ mkdir -p /data/logs
This can be adapted as you wish.

Adjust the settings for this installation in the .env file.

POSTGRES_PASSWORD is the password for the postgres role in the dockerized database.

DATAPOOL_PASSWORD is the password for the datapool role in the dockerized database.

DATAPOOL_READER_PASSWORD is the password for the datapool-reader role in the dockerized database.

POSTGRES_PORT is port for the postgres db on the host (default is 5432).

EXTRA_DATA is designed to store additional scripts that are supposed to be accessible for conversion scripts.

LANDING_ZONE is the root folder of the landing zone on the host server.

BACKUP_LANDING_ZONE is the root folder of the backup landing zone on the host server.

DEVELOPMENT_LANDING_ZONES_ROOT is the folder host server which will be used for development landing zones. Contrary to a manual setup development landing zones can not be created at arbitrary locations on the host machine, but must be located within DEVELOPMENT_LANDING_ZONES_ROOT.

HTTP_PORT is the port to query status, logs and metrics of the datapool server.

CPU_ARCHITECTURE specifies the architecture of the docker containers to use (beta)

Here’s an example:
# postgres access configuration
POSTGRES_PASSWORD=datapoolrocks666
POSTGRES_PORT=5432

# database user passwords
DATAPOOL_PASSWORD=corona666
DATAPOOL_READER_PASSWORD=switzerland2021

# landing zone folder on host
LANDING_ZONE=/data/landing_zone

# backup landing zone folder on host
BACKUP_LANDING_ZONE=/data/backup_landing_zone

# folder hosting development landing zones
DEVELOPMENT_LANDING_ZONES_ROOT=/data/development

# API port
HTTP_PORT=8000

# extra files
EXTRA_DATA=/data/extra

# database data location
POSTGRES_DB_HOST=/data/postgres/data

# persist log files
LOG_FILES=/data/logs

# internal
CPU_ARCHITECTURE=amd64
Change settings for EXTRA_DATA, LANDING_ZONE, BACKUP_LANDING_ZONE and DEVELOPMENT_LANDING_ZONES_ROOT in .env accordingly.

Then (up to personal modifications):
$ chgrp docker /data/development
$ chmod g+ws /data/development
$ chgrp docker /data/landing_zone
$ chmod g+ws /data/landing_zone
$ chgrp docker /data/backup_landing_zone
$ chmod g+ws /data/backup_landing_zone
$ chgrp docker /data/extra
$ chmod g+ws /data/extra
Build containers
$ docker-compose build
To setup service which also restarts after reboot:
$ cp datapool.service /etc/systemd/system/
$ systemctl enable datapool
$ systemctl start datapool
$ systemctl status datapool
To install pool command:
$ cp pool /usr/local/bin
$ chgrp docker /usr/local/bin/pool
$ chmod o-x /usr/local/bin/pool
To allow user USER-ABC to use pool command:
$ usermod -aG docker USER-ABC
The user USER-ABC should be able to run pool commands after login now.

Update Datapool¶

Change the directory to the datapool repository:

$ cd /opt/datapool

Start root bash:

$ sudo bash

Update the datapool repository:

$ git pull

Enter the docker folder:

$ cd docker

Rebuild the docker containers:

$ docker-compose build

Restart the datapool service and check if it runs:

$ systemctl restart datapool
$ systemctl status datapool

To check datapool service logs:

$ journalctl -u datapool -f

To check docker logs:

$ docker-compose logs -f

Access to log files¶

If there is an error message in the command line it is helpful to look into the logfiles for further explanation of the error. The easiest way is to use the journalctl system service.

$ sudo journalctl -u datapool --since yesterday |grep error

-u filters for everything related to datapool. With --since a timeframe can be specified. And with |grep a specific string can be searched. For more information use the help function.

$ journalctl --help

Further Tips¶

Accessing docker container (opening bash in container eg. docker_datapool_1):

$ docker exec -it docker_datapool_1 bash

Accessing database in container (docker_postgis_1) via psql:

$ docker exec -it docker_postgis_1 psql -U postgres datapool

Taking care of permission issues for a folder X:

$ chgrp -R docker folder_X
$ chmod -R g+rwx folder_X

Clean Up tmp Files in Datapool Container¶

When a conversion script runs, it writes an output file into the temporary folder of the datapool docker container. If the conversion is unsuccessful the tmp file is not deleted afterwards. This way a lot of data can collect in the /tmp folder of the datapool container.

To free up space access the container and delete the temporary files.

$ docker exec -it docker_datapool_1 bash
(docker_datapool_1)$ rm -rf /tmp/*

Backup and Restore Database¶

Creating a compressed backup:

$ sudo -u postgres pg_dump datapool | gzip > datapool_2021_06_23.gz

Restoring database from backup:

$ gunzip -c filename.gz | sudo -u postgres psql dbname

Adding Packages to R, Python, Julia¶

To install additional packages, to be used in conversion scripts, please adapt the files in doker/datapool folder. There is a file for each language that lists packages to be installed.

After adding desired packages run (in the docker folder):

$ docker-compose build && systemctl restart datapool

Migrating Datapool_1 to Datapool_2¶

Important

The following instructions are rather an example than a strict guideline! Depending on your system you might need to adapt details.
In order for the example scripts to work ssh keys on both servers are required.
For the following instructions it is assumed the Datapool_2 program is installed on a separate server (DP2).

Warning

The special value definition table and site images are not considered in the example.
It is recommended to save backups of the old Datapool_1 database and the backup_landing_zone.

Initially we will copy the database table signal (DP1).

Create a backup of the signal table. (Using your database user password.)
$ sudo su postgres
$ sudo pg_dump -h localhost -U datapool -C -t signal -d datapool > signal_dump.sql
Copy the dumped table to the new server (DP2) copy the data into the directory where the postgresql databse of the postgis docker container is mounted. For if you followed the installation instructions described above this would be the folder: /data/postgres/data
$ sudo rsync -v -als -e ssh user@dp1-server:/path/to/backup/signal_dump.sql /path/postgis/docker/data/mount/
Restore signal table in the new database (DP2)
$ # access docker container
$ sudo docker exec -it docker_postgis_1 bash # enter docker container
$ # removing foreign key constraint
$ psql -U datapool -c "ALTER TABLE signal DROP CONSTRAINT signal_parameter_id_fkey;"
$ psql -U datapool -c "ALTER TABLE picture DROP CONSTRAINT picture_site_id_fkey;"
$ psql -U datapool -c "ALTER TABLE signal DROP CONSTRAINT signal_site_id_fkey;"
$ psql -U datapool -c "ALTER TABLE signal DROP CONSTRAINT signal_source_id_fkey;"
$ # restore table
$ psql -h localhost -U datapool -d datapool < /var/lib/postgresql/data/signal_dump.sql # restore in docker container

Create a development landing zone on the (DP2) server.
$ pool add dlz
Install dependencies for the usage of later scripts:
$ pip install pandas, sqlalchemy, psycopg2-binary
Create source_types, sources, parameters and sites via the pool add command.

As some tables have changed from the DP1 to the DP2, we need to create the respective entities again with slight modification. In order to do so, we’ll be reading information from the DP1 database and create pool add commands from these values.

Preview or download the notebook handling the pool add command generation.
Update the life datapool landing zone.
Change directory to the datapool development directory, you configured in the installation.
$ cd /YOUR/DEVELOPMENT/DIRECTORY
Check for problems:
$ pool check dlz
Update operation landing zone:
$ pool update-operational dlz
Fixing entity ids for sources, parameters and sites.
The signal table we copied in step 1. contains the ids from the DP1. After creating all entities again the DP2, the current ids do not match. In order to fix them we match the old ids to the new ones and update the tables. Updating the tables is done in two steps:
Introduce an offset to all current ids. (default offset is 1000)

Map the offset ids to the ones from the DP1.
Preview or download the notebook handling correction of the table ids.
Create another development landing zone.
$ pool add adding_conversion_scripts
Add conversion scripts and raw files from the DP1.

Currently the operational landing zone doe not have any conversion scripts. In this step we’ll copy them from the DP1. We’ll also copy corresponding raw files from the backup landing zone.

Preview or download the notebook handling the conversion script migration.

Reinstate foreign key constraints.

$ # access docker container
$ sudo docker exec -it docker_postgis_1 bash # enter docker container
$ # removing foreign key constraint
$ psql -U datapool -c "ALTER TABLE signal ADD CONSTRAINT signal_parameter_id_fkey FOREIGN KEY (parameter_id) REFERENCES parameter(parameter_id);"
$ psql -U datapool -c "ALTER TABLE signal ADD CONSTRAINT signal_site_id_fkey FOREIGN KEY (site_id) REFERENCES site(site_id);"
$ psql -U datapool -c "ALTER TABLE signal ADD CONSTRAINT signal_source_id_fkey FOREIGN KEY (source_id) REFERENCES source(source_id);"
$ psql -U datapool -c "ALTER TABLE picture ADD CONSTRAINT picture_site_id_fkey FOREIGN KEY (site_id) REFERENCES site(site_id);"

Check data.

One way to check whether the migration was successful is to check all entity tables including ids (source, parameter, site) and the mapping of source_types to sources. Finally the signal table must be checked. That can take a few hours, depending on how much data is present. You could check the signal table via retrieving all data for each source for both instances and check it’s the same data.