Obsolete documentation¶
This material should be worked in to the other sections.
Run server in test mode¶
The following sequence initializes datapool
and runs the server in single
process mode.
$ rm -rf ./lz 2>/dev/null
$ export ETC=./etc
$ rm -rf $ETC 2>/dev/null
$ pool init-config --use-sqlitedb ./lz
$ pool init-db
$ pool check-config
$ pool run-simple-server
Usually pool init-config
would write to /etc/datapool
and thus the
command requires root
privileges. Setting the environment variable ETC
allows overriding the /etc
folder so we do not interfere with a global
setup.
Further we use --use-sqlitedb
so configuration and setup of a data base
system as Postgres is not required. This flag is introduced for testing, in
operational mode we recommond to avoid this flag and configer Postgres
instead.
The last run-simple-server
command will observe changes to the operational
landing zone at ./lz and report its operations. The command does not run in
the background and thus will block the terminal until the user presses CTRL-C
to enforce shutdown.
As a data provider we open another terminal window, setup a development landing
zone and commit the defaults to the operational landing zone. You should then
see some output from the run-simple-server
command in the previous terminal
window:
$ rm -rf ./dlz 2>/dev/null
$ export ETC=./etc
$ pool start-develop dlz
$ pool check dlz
$ pool update-operational dlz
Workflow example¶
To initialize datapool
configuration on the current server run the init-config
subcommand,
this might require admin permissions because the config file is stored in the /etc/datapool
folder:
$ pool init-config ./lz
> init-config
- guess settings
- 'matlab' not found on $PATH
- created config files at /etc/datapool
please edit these files and adapt the data base configuration to your setup
+ initialized landing zone at ./lz
Then edit this file and run pool check-config
:
$ pool check-config
> check-config
- check settings in config file /etc/datapool/datapool.ini
- try to connect to db
- could not connect to db postgresql://user:password@localhost:5432/datapool
- check R configuration + code execution
- matlab not configured, skip tests
- check julia configuration + code execution
- check julia version.
- check python configuration + code execution
+ all checks passed
To start development create a so called development landing zone* which can be an arbitrary folder:
$ pool start-develop ./dlz
> start-develop
- setup development landing zone
- operational landing zone is empty. create development landing zone with example files.
+ setup done
This copied some example .yaml
files, conversion scripts and raw data files. To check
the scripts run:
$ pool check-scripts ./dlz
> check-scripts
- check landing zone at ./dlz
- check ./dlz/data/sensor_from_company_xyz/sensor_instance_julia/conversion.jl
- wrote conversion result to /tmp/tmp9hcxslxv/sensor_instance_julia_0.csv
- wrote conversion result to /tmp/tmp9hcxslxv/sensor_instance_julia_0.txt
- check ./dlz/data/sensor_from_company_xyz/sensor_instance_python/conversion.py
- wrote conversion result to /tmp/tmp9hcxslxv/sensor_instance_python_0.csv
- wrote conversion result to /tmp/tmp9hcxslxv/sensor_instance_python_0.txt
- check ./dlz/data/sensor_from_company_xyz/sensor_instance_r/conversion.r
- wrote conversion result to /tmp/tmp9hcxslxv/sensor_instance_r_0.csv
- wrote conversion result to /tmp/tmp9hcxslxv/sensor_instance_r_0.txt
+ congratulations: checks succeeded.
This checked the scripts and you can inspect the results files as displayed in the output.
To check the .yaml
files:
$ pool check-yamls ./dlz/
> check-yamls
- check yamls in landing zone at ./dlz/
- setup fresh development db. productive does not exist or is empty.
- load and check 1 new yaml files:
- ./dlz/data/parameters.yaml
+ all yaml files checked
Now you can upload the changes from the development landing zone to the operational landing zone:
$ pool update-operational ./dlz
> update-operational
- check before copying files around.
- copied data/parameters.yaml
- copied data/sensor_from_company_xyz/sensor_instance_julia/conversion.jl
- copied data/sensor_from_company_xyz/sensor_instance_julia/raw_data/data-001.raw
- copied data/sensor_from_company_xyz/sensor_instance_matlab/raw_data/data-001.raw
- copied data/sensor_from_company_xyz/sensor_instance_python/conversion.py
- copied data/sensor_from_company_xyz/sensor_instance_python/raw_data/data-001.raw
- copied data/sensor_from_company_xyz/sensor_instance_r/conversion.r
- copied data/sensor_from_company_xyz/sensor_instance_r/raw_data/data-001.raw
- copied data/sensor_from_company_xyz/source_type.yaml
- copied sites/example_site/images/24G35_regenwetter.jpg
- copied sites/example_site/images/IMG_0312.JPG
- copied sites/example_site/images/IMG_0732.JPG
- copied sites/example_site/site.yaml
+ copied 13 files to ./lz
Deprecated (DP1 to DP1): Data Migration to another Datapool Instance¶
The following describes the procedure to migrate data and the entire datapool setup from one server to another. Prerequisites are, that the new server already has the datapool installed and configured. It will be distinguished between old instance OI (where the data lies at the moment) and new instance NI (where the data is supposed to be migrated to). The following commands will be flagged so that it is easier to follow on which server to run which command.
Backup current Database (OI)
$ sudo pg_dump -U datapool -h localhost datapool > /path/to/backup/backup_todaysDate.sql
Create Development Landing Zone and move it to the Backup Folder (OI)
$ cd $ pool start-develop dlz_todaysDate $ mv ~/dlz_todaysDate /path/to/backup/
Copy Landing Zone Backup (OI)
To save storage volume on the new system you might want to delete the files contained in all
raw_data
folders. I’m assuming the Datapool Landing Zone is in the directory /nfsmount.$ cp -r /nfsmount/landing_zone_backup /path/to/backup/
Pull Backup Folder to new Server (NI)
$ sudo rsync -v -als -e ssh user@oldHost:/path/to/backup/ /path/to/backup/on/new/instance
Stop the Datapool Service (NI)
$ sudo systemctl stop datapool.service
Copy Content of Landing Zone and Landing Zone Backup to rightful Place (NI)
I’m assuming the Datapool Landing Zone is in the directory /nfsmount.
$ sudo mv /path/to/backup/on/new/instance/dlz_todaysDate/* /nfsmount/landing_zone/ $ sudo mv /path/to/backup/on/new/instance/landing_zone_backup/* /nfsmount/landing_zone_backup/
Remove Datapool Database (DD) and Create a new empty DD (NI)
$ PGPASSWORD=YOURPASSWORD psql -U datapool -h 127.0.0.1 postgres postgres=> drop database datapool; postgres=> ^+D #(strg+D -> to exit postgres prompt) $ sudo -u postgres createdb -O datapool datapool
Restore Database (NI)
$ sudo psql -U datapool -h localhost datapool < /path/to/backup/on/new/instance/backup_todaysDate.sql
Start Datapool Service (NI)
sudo systemctl start datapool.service
Check Datapool Service (NI)
systemctl status datapool.service
Check Database Integrity
The data migration is finished. This last step contains a little snipped to check the migrated data. This python script runIntegrityCheck.py checks whether all tables have been migrated and whether all columns of each table are present. Additionally all entries of each table are cross checked with the old instance (except the signal-table). The signal table is only checked via the signal_id, thereby monitoring whether all signals have been copied.
The checks are being performed via querying the old and the new instance simultaneously and comparing the outputs. Connection details to both servers/databases must be adapted and the output will be printed only!
The queries to both signal table are retrieving monthly chunks. If you run out of RAM scale down the retrieved chunk size to weekly or daily.
$ python runIntegrityCheck.py
runIntegrityCheck.py
import psycopg2 import pandas as pd def query(qq,host,port,database, user, password): with psycopg2.connect(host=host, port=port, database=database, user=user, password=password) as conn: with conn.cursor() as cur: cur.execute(qq) fetch = cur.fetchall() desc = cur.description return fetch, desc def queryOld(qq): host = "YOUR-OLD-HOSTS-IP" port = "YOUR-OLD-HOSTS-PORT" database = "datapool" user = "datapool" password = "YOUR-OLD-HOSTS-DATABASE-PASSWORD" return query(qq,host,port,database, user, password) def queryNew(qq): host = "YOUR-NEW-HOSTS-IP" port = "YOUR-NEW-HOSTS-PORT" database = "datapool" user = "datapool" password = "YOUR-NEW-HOSTS-DATABASE-PASSWORD" return query(qq,host,port,database, user, password) ### --- # Checking if same Tables exist checkTables = "SELECT table_name FROM information_schema.tables WHERE table_schema='public';" ot = queryOld(checkTables)[0] nt = queryNew(checkTables)[0] onlyInOld = set(ot).difference(set(nt)) if onlyInOld != set(): print("Tables - in Old and not in New:\n{}".format(onlyInOld)) TABLES = [i[0] for i in ot] ### --- ### --- # Checking if same Tables have same columns for table in TABLES: checkColumns = """SELECT * FROM information_schema.columns WHERE table_schema = 'public' AND table_name = '{}' ;""".format(table) oc = queryOld(checkColumns)[0] nc = queryNew(checkColumns)[0] oCols = [c[3] for c in oc] nCols = [c[3] for c in nc] onlyInOld = set(oCols).difference(set(nCols)) if onlyInOld != set(): print("Columns - in Old and not in New:\n{}".format(onlyInOld)) ### --- ### --- # Checking if Tables are identical (slicing queries monthly due to large amount of data) timeDataInDataBaseStart = "20130101000000" timeDataInDataBaseEnd = "20190701000000" timeRange = pd.date_range(timeDataInDataBaseStart, timeDataInDataBaseEnd, freq="1m") # "1m" -> 1 Month timeRange = [dt.strftime("%Y-%m-%d %X") for dt in timeRange] timeRange = [(timeRange[i], timeRange[i+1]) for i in range(len(timeRange)-1)] for table in TABLES: if table !="signal": checkContent = """SELECT * FROM {};""".format(table) oc = queryOld(checkContent)[0] nc = queryNew(checkContent)[0] onlyInOld = set(oc).difference(set(nc)) if onlyInOld != set(): print("Columns - in Old and not in New:\n{}".format(onlyInOld)) else: for start, end in timeRange: checkContent = """SELECT signal_id FROM signal WHERE '{}'::timestamp <= signal.timestamp AND signal.timestamp <= '{}'::timestamp ORDER BY signal_id ASC;""".format(start,end) oc = queryOld(checkContent)[0] nc = queryNew(checkContent)[0] onlyInOld = set(oc).difference(set(nc)) if onlyInOld != set(): print("Content - in Old and not in New {}:\n{}".format((start,end),onlyInOld)) ### ---