Obsolete documentation ====================== This material should be worked in to the other sections. Run server in test mode ----------------------- The following sequence initializes ``datapool`` and runs the server in single process mode. .. code-block:: bash $ rm -rf ./lz 2>/dev/null $ export ETC=./etc $ rm -rf $ETC 2>/dev/null $ pool init-config --use-sqlitedb ./lz $ pool init-db $ pool check-config $ pool run-simple-server Usually ``pool init-config`` would write to ``/etc/datapool`` and thus the command requires ``root`` privileges. Setting the environment variable ``ETC`` allows overriding the ``/etc`` folder so we do not interfere with a global setup. Further we use ``--use-sqlitedb`` so configuration and setup of a data base system as Postgres is not required. This flag is introduced for testing, in operational mode we recommond to avoid this flag and configer Postgres instead. The last ``run-simple-server`` command will observe changes to the operational landing zone at `./lz` and report its operations. The command does not run in the background and thus will block the terminal until the user presses ``CTRL-C`` to enforce shutdown. As a data provider we open another terminal window, setup a development landing zone and commit the defaults to the operational landing zone. You should then see some output from the ``run-simple-server`` command in the previous terminal window: .. code-block:: bash $ rm -rf ./dlz 2>/dev/null $ export ETC=./etc $ pool start-develop dlz $ pool check dlz $ pool update-operational dlz Workflow example ---------------- To initialize ``datapool`` configuration on the current server run the ``init-config`` subcommand, this might require admin permissions because the config file is stored in the ``/etc/datapool`` folder: .. code-block:: bash $ pool init-config ./lz > init-config - guess settings - 'matlab' not found on $PATH - created config files at /etc/datapool please edit these files and adapt the data base configuration to your setup + initialized landing zone at ./lz Then edit this file and run ``pool check-config``: .. code-block:: bash $ pool check-config > check-config - check settings in config file /etc/datapool/datapool.ini - try to connect to db - could not connect to db postgresql://user:password@localhost:5432/datapool - check R configuration + code execution - matlab not configured, skip tests - check julia configuration + code execution - check julia version. - check python configuration + code execution + all checks passed To start development create a so called *development landing zone** which can be an arbitrary folder: .. code-block:: bash $ pool start-develop ./dlz > start-develop - setup development landing zone - operational landing zone is empty. create development landing zone with example files. + setup done This copied some example ``.yaml`` files, conversion scripts and raw data files. To check the scripts run: .. code-block:: bash $ pool check-scripts ./dlz > check-scripts - check landing zone at ./dlz - check ./dlz/data/sensor_from_company_xyz/sensor_instance_julia/conversion.jl - wrote conversion result to /tmp/tmp9hcxslxv/sensor_instance_julia_0.csv - wrote conversion result to /tmp/tmp9hcxslxv/sensor_instance_julia_0.txt - check ./dlz/data/sensor_from_company_xyz/sensor_instance_python/conversion.py - wrote conversion result to /tmp/tmp9hcxslxv/sensor_instance_python_0.csv - wrote conversion result to /tmp/tmp9hcxslxv/sensor_instance_python_0.txt - check ./dlz/data/sensor_from_company_xyz/sensor_instance_r/conversion.r - wrote conversion result to /tmp/tmp9hcxslxv/sensor_instance_r_0.csv - wrote conversion result to /tmp/tmp9hcxslxv/sensor_instance_r_0.txt + congratulations: checks succeeded. This checked the scripts and you can inspect the results files as displayed in the output. To check the ``.yaml`` files: .. code-block:: bash $ pool check-yamls ./dlz/ > check-yamls - check yamls in landing zone at ./dlz/ - setup fresh development db. productive does not exist or is empty. - load and check 1 new yaml files: - ./dlz/data/parameters.yaml + all yaml files checked Now you can upload the changes from the development landing zone to the operational landing zone: .. code-block:: bash $ pool update-operational ./dlz > update-operational - check before copying files around. - copied data/parameters.yaml - copied data/sensor_from_company_xyz/sensor_instance_julia/conversion.jl - copied data/sensor_from_company_xyz/sensor_instance_julia/raw_data/data-001.raw - copied data/sensor_from_company_xyz/sensor_instance_matlab/raw_data/data-001.raw - copied data/sensor_from_company_xyz/sensor_instance_python/conversion.py - copied data/sensor_from_company_xyz/sensor_instance_python/raw_data/data-001.raw - copied data/sensor_from_company_xyz/sensor_instance_r/conversion.r - copied data/sensor_from_company_xyz/sensor_instance_r/raw_data/data-001.raw - copied data/sensor_from_company_xyz/source_type.yaml - copied sites/example_site/images/24G35_regenwetter.jpg - copied sites/example_site/images/IMG_0312.JPG - copied sites/example_site/images/IMG_0732.JPG - copied sites/example_site/site.yaml + copied 13 files to ./lz Deprecated (DP1 to DP1): Data Migration to another Datapool Instance -------------------------------------------------------------------- The following describes the procedure to migrate data and the entire datapool setup from one server to another. Prerequisites are, that the new server already has the datapool installed and configured. It will be distinguished between old instance **OI** (where the data lies at the moment) and new instance **NI** (where the data is supposed to be migrated to). The following commands will be flagged so that it is easier to follow on which server to run which command. 1. Backup current Database (**OI**) .. code-block:: none $ sudo pg_dump -U datapool -h localhost datapool > /path/to/backup/backup_todaysDate.sql 2. Create Development Landing Zone and move it to the Backup Folder (**OI**) .. code-block:: none $ cd $ pool start-develop dlz_todaysDate $ mv ~/dlz_todaysDate /path/to/backup/ 3. Copy Landing Zone Backup (**OI**) To save storage volume on the new system you might want to delete the files contained in all ``raw_data`` folders. I'm assuming the `Datapool Landing Zone` is in the directory `/nfsmount`. .. code-block:: none $ cp -r /nfsmount/landing_zone_backup /path/to/backup/ 4. Pull Backup Folder to new Server (**NI**) .. code-block:: none $ sudo rsync -v -als -e ssh user@oldHost:/path/to/backup/ /path/to/backup/on/new/instance 5. Stop the Datapool Service (**NI**) .. code-block:: none $ sudo systemctl stop datapool.service 6. Copy Content of Landing Zone and Landing Zone Backup to rightful Place (**NI**) I'm assuming the `Datapool Landing Zone` is in the directory `/nfsmount`. .. code-block:: none $ sudo mv /path/to/backup/on/new/instance/dlz_todaysDate/* /nfsmount/landing_zone/ $ sudo mv /path/to/backup/on/new/instance/landing_zone_backup/* /nfsmount/landing_zone_backup/ 7. Remove Datapool Database (DD) and Create a new empty DD (**NI**) .. code-block:: none $ PGPASSWORD=YOURPASSWORD psql -U datapool -h 127.0.0.1 postgres postgres=> drop database datapool; postgres=> ^+D #(strg+D -> to exit postgres prompt) $ sudo -u postgres createdb -O datapool datapool 8. Restore Database (**NI**) .. code-block:: none $ sudo psql -U datapool -h localhost datapool < /path/to/backup/on/new/instance/backup_todaysDate.sql 9. Start Datapool Service (**NI**) .. code-block:: none sudo systemctl start datapool.service 10. Check Datapool Service (**NI**) .. code-block:: none systemctl status datapool.service 11. Check Database Integrity The data migration is finished. This last step contains a little snipped to check the migrated data. This python script **runIntegrityCheck.py** **checks** whether **all tables** have been migrated and whether **all columns** of each table are present. Additionally **all entries of each table** are cross checked with the old instance (except the signal-table). The **signal table** is only **checked via the signal_id**, thereby monitoring whether all signals have been copied. The checks are being performed via querying the old and the new instance simultaneously and comparing the outputs. Connection details to both servers/databases must be adapted and the output will be printed only! The queries to both **signal table** are retrieving monthly chunks. If you run out of RAM scale down the retrieved chunk size to weekly or daily. .. code-block:: none $ python runIntegrityCheck.py **runIntegrityCheck.py** .. code-block:: python import psycopg2 import pandas as pd def query(qq,host,port,database, user, password): with psycopg2.connect(host=host, port=port, database=database, user=user, password=password) as conn: with conn.cursor() as cur: cur.execute(qq) fetch = cur.fetchall() desc = cur.description return fetch, desc def queryOld(qq): host = "YOUR-OLD-HOSTS-IP" port = "YOUR-OLD-HOSTS-PORT" database = "datapool" user = "datapool" password = "YOUR-OLD-HOSTS-DATABASE-PASSWORD" return query(qq,host,port,database, user, password) def queryNew(qq): host = "YOUR-NEW-HOSTS-IP" port = "YOUR-NEW-HOSTS-PORT" database = "datapool" user = "datapool" password = "YOUR-NEW-HOSTS-DATABASE-PASSWORD" return query(qq,host,port,database, user, password) ### --- # Checking if same Tables exist checkTables = "SELECT table_name FROM information_schema.tables WHERE table_schema='public';" ot = queryOld(checkTables)[0] nt = queryNew(checkTables)[0] onlyInOld = set(ot).difference(set(nt)) if onlyInOld != set(): print("Tables - in Old and not in New:\n{}".format(onlyInOld)) TABLES = [i[0] for i in ot] ### --- ### --- # Checking if same Tables have same columns for table in TABLES: checkColumns = """SELECT * FROM information_schema.columns WHERE table_schema = 'public' AND table_name = '{}' ;""".format(table) oc = queryOld(checkColumns)[0] nc = queryNew(checkColumns)[0] oCols = [c[3] for c in oc] nCols = [c[3] for c in nc] onlyInOld = set(oCols).difference(set(nCols)) if onlyInOld != set(): print("Columns - in Old and not in New:\n{}".format(onlyInOld)) ### --- ### --- # Checking if Tables are identical (slicing queries monthly due to large amount of data) timeDataInDataBaseStart = "20130101000000" timeDataInDataBaseEnd = "20190701000000" timeRange = pd.date_range(timeDataInDataBaseStart, timeDataInDataBaseEnd, freq="1m") # "1m" -> 1 Month timeRange = [dt.strftime("%Y-%m-%d %X") for dt in timeRange] timeRange = [(timeRange[i], timeRange[i+1]) for i in range(len(timeRange)-1)] for table in TABLES: if table !="signal": checkContent = """SELECT * FROM {};""".format(table) oc = queryOld(checkContent)[0] nc = queryNew(checkContent)[0] onlyInOld = set(oc).difference(set(nc)) if onlyInOld != set(): print("Columns - in Old and not in New:\n{}".format(onlyInOld)) else: for start, end in timeRange: checkContent = """SELECT signal_id FROM signal WHERE '{}'::timestamp <= signal.timestamp AND signal.timestamp <= '{}'::timestamp ORDER BY signal_id ASC;""".format(start,end) oc = queryOld(checkContent)[0] nc = queryNew(checkContent)[0] onlyInOld = set(oc).difference(set(nc)) if onlyInOld != set(): print("Content - in Old and not in New {}:\n{}".format((start,end),onlyInOld)) ### ---