Instructions for Admin¶
Installation on Ubuntu 20.04¶
Run the following instructions as root
(as administrator).
If a file needs to be opened/modified, use the editor “nano” (type nano
in front of the filename).
Start root bash:
$ sudo bashUbuntu packages
$ apt update $ apt install -y git docker.io docker-composeClone this repository to a desired folder (here
/opt
) and enter the docker folder.$ cd /opt $ git clone https://sissource.ethz.ch/sispub/datapool2/datapool.git $ cd datapool/docker $ cp .env.example .envSetup folders for the home of development landing zones, the operational landing zone, the backup landing zone, extra scripts you might want to have accessible for the datapool. E.g.
$ mkdir -p /data/landing_zone $ mkdir -p /data/backup_landing_zone $ mkdir -p /data/development $ mkdir -p /data/extraBeyond that also create folders for the data base files and datapool log files:
$ mkdir -p /data/postgres $ mkdir -p /data/logsThis can be adapted as you wish.
Adjust the settings for this installation in the
.env
file.
POSTGRES_PASSWORD
is the password for the postgres role in the dockerized database.
DATAPOOL_PASSWORD
is the password for the datapool role in the dockerized database.
DATAPOOL_READER_PASSWORD
is the password for the datapool-reader role in the dockerized database.
POSTGRES_PORT
is port for the postgres db on the host (default is 5432).
EXTRA_DATA
is designed to store additional scripts that are supposed to be accessible for conversion scripts.
LANDING_ZONE
is the root folder of the landing zone on the host server.
BACKUP_LANDING_ZONE
is the root folder of the backup landing zone on the host server.
DEVELOPMENT_LANDING_ZONES_ROOT
is the folder host server which will be used for development landing zones. Contrary to a manual setup development landing zones can not be created at arbitrary locations on the host machine, but must be located withinDEVELOPMENT_LANDING_ZONES_ROOT
.
HTTP_PORT
is the port to query status, logs and metrics of the datapool server.
CPU_ARCHITECTURE
specifies the architecture of the docker containers to use (beta)Here’s an example:
# postgres access configuration POSTGRES_PASSWORD=datapoolrocks666 POSTGRES_PORT=5432 # database user passwords DATAPOOL_PASSWORD=corona666 DATAPOOL_READER_PASSWORD=switzerland2021 # landing zone folder on host LANDING_ZONE=/data/landing_zone # backup landing zone folder on host BACKUP_LANDING_ZONE=/data/backup_landing_zone # folder hosting development landing zones DEVELOPMENT_LANDING_ZONES_ROOT=/data/development # API port HTTP_PORT=8000 # extra files EXTRA_DATA=/data/extra # database data location POSTGRES_DB_HOST=/data/postgres/data # persist log files LOG_FILES=/data/logs # internal CPU_ARCHITECTURE=amd64
Change settings for
EXTRA_DATA
,LANDING_ZONE
,BACKUP_LANDING_ZONE
andDEVELOPMENT_LANDING_ZONES_ROOT
in.env
accordingly.Then (up to personal modifications):
$ chgrp docker /data/development $ chmod g+ws /data/development $ chgrp docker /data/landing_zone $ chmod g+ws /data/landing_zone $ chgrp docker /data/backup_landing_zone $ chmod g+ws /data/backup_landing_zone $ chgrp docker /data/extra $ chmod g+ws /data/extra
Build containers
$ docker-compose buildTo setup service which also restarts after reboot:
$ cp datapool.service /etc/systemd/system/ $ systemctl enable datapool $ systemctl start datapool $ systemctl status datapoolTo install
pool
command:$ cp pool /usr/local/bin $ chgrp docker /usr/local/bin/pool $ chmod o-x /usr/local/bin/poolTo allow user
USER-ABC
to usepool
command:$ usermod -aG docker USER-ABCThe user
USER-ABC
should be able to runpool
commands after login now.
Update Datapool¶
Change the directory to the datapool repository:
$ cd /opt/datapool
Start root bash:
$ sudo bash
Update the datapool repository:
$ git pull
Enter the docker folder:
$ cd docker
Rebuild the docker containers:
$ docker-compose build
Restart the datapool service and check if it runs:
$ systemctl restart datapool $ systemctl status datapool
To check datapool service logs:
$ journalctl -u datapool -f
To check docker logs:
$ docker-compose logs -f
Access to log files¶
If there is an error message in the command line it is helpful to look into the logfiles for further explanation of the error. The easiest way is to use the journalctl system service.
$ sudo journalctl -u datapool --since yesterday |grep error
-u
filters for everything related to datapool. With --since
a timeframe can be specified. And with |grep
a specific string can be searched. For more information use the help function.
$ journalctl --help
Further Tips¶
Accessing docker container (opening bash in container eg. docker_datapool_1):
$ docker exec -it docker_datapool_1 bash
Accessing database in container (docker_postgis_1) via psql:
$ docker exec -it docker_postgis_1 psql -U postgres datapool
Taking care of permission issues for a folder X:
$ chgrp -R docker folder_X $ chmod -R g+rwx folder_X
Clean Up tmp Files in Datapool Container¶
When a conversion script runs, it writes an output file into the temporary folder of the datapool docker container. If the conversion is unsuccessful the tmp file is not deleted afterwards. This way a lot of data can collect in the /tmp folder of the datapool container.
To free up space access the container and delete the temporary files.
$ docker exec -it docker_datapool_1 bash (docker_datapool_1)$ rm -rf /tmp/*
Backup and Restore Database¶
Creating a compressed backup:
$ sudo -u postgres pg_dump datapool | gzip > datapool_2021_06_23.gz
Restoring database from backup:
$ gunzip -c filename.gz | sudo -u postgres psql dbname
Adding Packages to R, Python, Julia¶
To install additional packages, to be used in conversion scripts, please adapt the files in doker/datapool folder. There is a file for each language that lists packages to be installed.
After adding desired packages run (in the docker folder):
$ docker-compose build && systemctl restart datapool
Migrating Datapool_1 to Datapool_2¶
Important
The following instructions are rather an example than a strict guideline! Depending on your system you might need to adapt details.
In order for the example scripts to work ssh keys on both servers are required.
For the following instructions it is assumed the Datapool_2 program is installed on a separate server (DP2).
Warning
The special value definition table and site images are not considered in the example.
It is recommended to save backups of the old Datapool_1 database and the backup_landing_zone.
Initially we will copy the database table signal (DP1).
Create a backup of the signal table. (Using your database user password.)
$ sudo su postgres $ sudo pg_dump -h localhost -U datapool -C -t signal -d datapool > signal_dump.sql
Copy the dumped table to the new server (DP2) copy the data into the directory where the postgresql databse of the postgis docker container is mounted. For if you followed the installation instructions described above this would be the folder:
/data/postgres/data
$ sudo rsync -v -als -e ssh user@dp1-server:/path/to/backup/signal_dump.sql /path/postgis/docker/data/mount/
Restore signal table in the new database (DP2)
$ # access docker container $ sudo docker exec -it docker_postgis_1 bash # enter docker container $ # removing foreign key constraint $ psql -U datapool -c "ALTER TABLE signal DROP CONSTRAINT signal_parameter_id_fkey;" $ psql -U datapool -c "ALTER TABLE picture DROP CONSTRAINT picture_site_id_fkey;" $ psql -U datapool -c "ALTER TABLE signal DROP CONSTRAINT signal_site_id_fkey;" $ psql -U datapool -c "ALTER TABLE signal DROP CONSTRAINT signal_source_id_fkey;" $ # restore table $ psql -h localhost -U datapool -d datapool < /var/lib/postgresql/data/signal_dump.sql # restore in docker container
Create a development landing zone on the (DP2) server.
$ pool add dlz
Install dependencies for the usage of later scripts:
$ pip install pandas, sqlalchemy, psycopg2-binary
Create source_types, sources, parameters and sites via the pool add command.
As some tables have changed from the DP1 to the DP2, we need to create the respective entities again with slight modification. In order to do so, we’ll be reading information from the DP1 database and create pool add commands from these values.
Preview or download the notebook handling the pool add command generation.
Update the life datapool landing zone.
Change directory to the datapool development directory, you configured in the installation.
$ cd /YOUR/DEVELOPMENT/DIRECTORY
Check for problems:
$ pool check dlz
Update operation landing zone:
$ pool update-operational dlz
Fixing entity ids for sources, parameters and sites.
The signal table we copied in step 1. contains the ids from the DP1. After creating all entities again the DP2, the current ids do not match. In order to fix them we match the old ids to the new ones and update the tables. Updating the tables is done in two steps:
Introduce an offset to all current ids. (default offset is 1000)
Map the offset ids to the ones from the DP1.
Preview or download the notebook handling correction of the table ids.
Create another development landing zone.
$ pool add adding_conversion_scripts
Add conversion scripts and raw files from the DP1.
Reinstate foreign key constraints.
$ # access docker container $ sudo docker exec -it docker_postgis_1 bash # enter docker container $ # removing foreign key constraint $ psql -U datapool -c "ALTER TABLE signal ADD CONSTRAINT signal_parameter_id_fkey FOREIGN KEY (parameter_id) REFERENCES parameter(parameter_id);" $ psql -U datapool -c "ALTER TABLE signal ADD CONSTRAINT signal_site_id_fkey FOREIGN KEY (site_id) REFERENCES site(site_id);" $ psql -U datapool -c "ALTER TABLE signal ADD CONSTRAINT signal_source_id_fkey FOREIGN KEY (source_id) REFERENCES source(source_id);" $ psql -U datapool -c "ALTER TABLE picture ADD CONSTRAINT picture_site_id_fkey FOREIGN KEY (site_id) REFERENCES site(site_id);"
Check data.
One way to check whether the migration was successful is to check all entity tables including ids (source, parameter, site) and the mapping of source_types to sources. Finally the signal table must be checked. That can take a few hours, depending on how much data is present. You could check the signal table via retrieving all data for each source for both instances and check it’s the same data.