Obsolete documentation

This material should be worked in to the other sections.

Run server in test mode

The following sequence initializes datapool and runs the server in single process mode.

$ rm -rf ./lz 2>/dev/null

$ export ETC=./etc
$ rm -rf $ETC 2>/dev/null

$ pool init-config --use-sqlitedb ./lz
$ pool init-db
$ pool check-config
$ pool run-simple-server

Usually pool init-config would write to /etc/datapool and thus the command requires root privileges. Setting the environment variable ETC allows overriding the /etc folder so we do not interfere with a global setup.

Further we use --use-sqlitedb so configuration and setup of a data base system as Postgres is not required. This flag is introduced for testing, in operational mode we recommond to avoid this flag and configer Postgres instead.

The last run-simple-server command will observe changes to the operational landing zone at ./lz and report its operations. The command does not run in the background and thus will block the terminal until the user presses CTRL-C to enforce shutdown.

As a data provider we open another terminal window, setup a development landing zone and commit the defaults to the operational landing zone. You should then see some output from the run-simple-server command in the previous terminal window:

$ rm -rf ./dlz 2>/dev/null
$ export ETC=./etc

$ pool start-develop dlz
$ pool check dlz
$ pool update-operational dlz

Workflow example

To initialize datapool configuration on the current server run the init-config subcommand, this might require admin permissions because the config file is stored in the /etc/datapool folder:

$ pool init-config ./lz

> init-config
- guess settings
  - 'matlab' not found on $PATH
- created config files at /etc/datapool
  please edit these files and adapt the data base configuration to your setup
+ initialized landing zone at ./lz

Then edit this file and run pool check-config:

$ pool check-config

> check-config
- check settings in config file /etc/datapool/datapool.ini
- try to connect to db
- could not connect to db postgresql://user:password@localhost:5432/datapool
- check R configuration + code execution
- matlab not configured, skip tests
- check julia configuration + code execution
- check julia version.
- check python configuration + code execution
+ all checks passed

To start development create a so called development landing zone* which can be an arbitrary folder:

$ pool start-develop ./dlz

> start-develop
- setup development landing zone
- operational landing zone is empty. create development landing zone with example files.
+ setup done

This copied some example .yaml files, conversion scripts and raw data files. To check the scripts run:

$ pool check-scripts ./dlz

> check-scripts
- check landing zone at ./dlz
- check ./dlz/data/sensor_from_company_xyz/sensor_instance_julia/conversion.jl
- wrote conversion result to /tmp/tmp9hcxslxv/sensor_instance_julia_0.csv
- wrote conversion result to /tmp/tmp9hcxslxv/sensor_instance_julia_0.txt
- check ./dlz/data/sensor_from_company_xyz/sensor_instance_python/conversion.py
- wrote conversion result to /tmp/tmp9hcxslxv/sensor_instance_python_0.csv
- wrote conversion result to /tmp/tmp9hcxslxv/sensor_instance_python_0.txt
- check ./dlz/data/sensor_from_company_xyz/sensor_instance_r/conversion.r
- wrote conversion result to /tmp/tmp9hcxslxv/sensor_instance_r_0.csv
- wrote conversion result to /tmp/tmp9hcxslxv/sensor_instance_r_0.txt
+ congratulations: checks succeeded.

This checked the scripts and you can inspect the results files as displayed in the output.

To check the .yaml files:

$ pool check-yamls ./dlz/

> check-yamls
- check yamls in landing zone at ./dlz/
- setup fresh development db. productive does not exist or is empty.
- load and check 1 new yaml files:
- ./dlz/data/parameters.yaml
+ all yaml files checked

Now you can upload the changes from the development landing zone to the operational landing zone:

$ pool update-operational ./dlz

> update-operational
- check before copying files around.
- copied data/parameters.yaml
- copied data/sensor_from_company_xyz/sensor_instance_julia/conversion.jl
- copied data/sensor_from_company_xyz/sensor_instance_julia/raw_data/data-001.raw
- copied data/sensor_from_company_xyz/sensor_instance_matlab/raw_data/data-001.raw
- copied data/sensor_from_company_xyz/sensor_instance_python/conversion.py
- copied data/sensor_from_company_xyz/sensor_instance_python/raw_data/data-001.raw
- copied data/sensor_from_company_xyz/sensor_instance_r/conversion.r
- copied data/sensor_from_company_xyz/sensor_instance_r/raw_data/data-001.raw
- copied data/sensor_from_company_xyz/source_type.yaml
- copied sites/example_site/images/24G35_regenwetter.jpg
- copied sites/example_site/images/IMG_0312.JPG
- copied sites/example_site/images/IMG_0732.JPG
- copied sites/example_site/site.yaml
+ copied 13 files to ./lz

Deprecated (DP1 to DP1): Data Migration to another Datapool Instance

The following describes the procedure to migrate data and the entire datapool setup from one server to another. Prerequisites are, that the new server already has the datapool installed and configured. It will be distinguished between old instance OI (where the data lies at the moment) and new instance NI (where the data is supposed to be migrated to). The following commands will be flagged so that it is easier to follow on which server to run which command.

  1. Backup current Database (OI)

    $ sudo pg_dump -U datapool -h localhost datapool > /path/to/backup/backup_todaysDate.sql
    
  2. Create Development Landing Zone and move it to the Backup Folder (OI)

    $ cd
    $ pool start-develop dlz_todaysDate
    $ mv ~/dlz_todaysDate /path/to/backup/
    
  3. Copy Landing Zone Backup (OI)

    To save storage volume on the new system you might want to delete the files contained in all raw_data folders. I’m assuming the Datapool Landing Zone is in the directory /nfsmount.

    $ cp -r /nfsmount/landing_zone_backup /path/to/backup/
    
  4. Pull Backup Folder to new Server (NI)

    $ sudo rsync -v -als -e ssh user@oldHost:/path/to/backup/ /path/to/backup/on/new/instance
    
  5. Stop the Datapool Service (NI)

    $ sudo systemctl stop datapool.service
    
  6. Copy Content of Landing Zone and Landing Zone Backup to rightful Place (NI)

    I’m assuming the Datapool Landing Zone is in the directory /nfsmount.

    $ sudo mv /path/to/backup/on/new/instance/dlz_todaysDate/* /nfsmount/landing_zone/
    $ sudo mv /path/to/backup/on/new/instance/landing_zone_backup/* /nfsmount/landing_zone_backup/
    
  7. Remove Datapool Database (DD) and Create a new empty DD (NI)

    $ PGPASSWORD=YOURPASSWORD psql -U datapool -h 127.0.0.1 postgres
    postgres=> drop database datapool;
    postgres=> ^+D #(strg+D -> to exit postgres prompt)
    $ sudo -u postgres createdb -O datapool datapool
    
  8. Restore Database (NI)

    $ sudo psql -U datapool -h localhost datapool < /path/to/backup/on/new/instance/backup_todaysDate.sql
    
  9. Start Datapool Service (NI)

    sudo systemctl start datapool.service
    
  10. Check Datapool Service (NI)

    systemctl status datapool.service
    
  11. Check Database Integrity

    The data migration is finished. This last step contains a little snipped to check the migrated data. This python script runIntegrityCheck.py checks whether all tables have been migrated and whether all columns of each table are present. Additionally all entries of each table are cross checked with the old instance (except the signal-table). The signal table is only checked via the signal_id, thereby monitoring whether all signals have been copied.

    The checks are being performed via querying the old and the new instance simultaneously and comparing the outputs. Connection details to both servers/databases must be adapted and the output will be printed only!

    The queries to both signal table are retrieving monthly chunks. If you run out of RAM scale down the retrieved chunk size to weekly or daily.

    $ python runIntegrityCheck.py
    

    runIntegrityCheck.py

    import psycopg2
    import pandas as pd
    
    def query(qq,host,port,database, user, password):
    
        with psycopg2.connect(host=host, port=port, database=database, user=user, password=password) as conn:
            with conn.cursor() as cur:
                cur.execute(qq)
                fetch = cur.fetchall()
                desc = cur.description
    
        return fetch, desc
    
    
    def queryOld(qq):
    
        host = "YOUR-OLD-HOSTS-IP"
        port = "YOUR-OLD-HOSTS-PORT"
        database = "datapool"
        user = "datapool"
        password = "YOUR-OLD-HOSTS-DATABASE-PASSWORD"
    
        return query(qq,host,port,database, user, password)
    
    def queryNew(qq):
    
        host = "YOUR-NEW-HOSTS-IP"
        port = "YOUR-NEW-HOSTS-PORT"
        database = "datapool"
        user = "datapool"
        password = "YOUR-NEW-HOSTS-DATABASE-PASSWORD"
    
        return query(qq,host,port,database, user, password)
    
    
    ### ---
    # Checking if same Tables exist
    checkTables = "SELECT table_name FROM information_schema.tables WHERE table_schema='public';"
    
    ot = queryOld(checkTables)[0]
    nt = queryNew(checkTables)[0]
    onlyInOld = set(ot).difference(set(nt))
    
    if onlyInOld != set():
        print("Tables - in Old and not in New:\n{}".format(onlyInOld))
    
    TABLES = [i[0] for i in ot]
    ### ---
    
    ### ---
    # Checking if same Tables have same columns
    for table in TABLES:
        checkColumns = """SELECT *
                        FROM information_schema.columns
                        WHERE table_schema = 'public'
                        AND table_name   = '{}'
                            ;""".format(table)
        oc = queryOld(checkColumns)[0]
        nc = queryNew(checkColumns)[0]
    
        oCols = [c[3] for c in oc]
        nCols = [c[3] for c in nc]
    
        onlyInOld = set(oCols).difference(set(nCols))
        if onlyInOld != set():
            print("Columns - in Old and not in New:\n{}".format(onlyInOld))
    ### ---
    
    ### ---
    # Checking if Tables are identical (slicing queries monthly due to large amount of data)
    timeDataInDataBaseStart = "20130101000000"
    timeDataInDataBaseEnd = "20190701000000"
    timeRange = pd.date_range(timeDataInDataBaseStart, timeDataInDataBaseEnd, freq="1m") # "1m" -> 1 Month
    timeRange = [dt.strftime("%Y-%m-%d %X") for dt in timeRange]
    timeRange = [(timeRange[i], timeRange[i+1]) for i in range(len(timeRange)-1)]
    
    for table in TABLES:
        if table !="signal":
            checkContent = """SELECT * FROM {};""".format(table)
    
            oc = queryOld(checkContent)[0]
            nc = queryNew(checkContent)[0]
            onlyInOld = set(oc).difference(set(nc))
    
            if onlyInOld != set():
                print("Columns - in Old and not in New:\n{}".format(onlyInOld))
    
        else:
            for start, end in timeRange:
                checkContent = """SELECT signal_id FROM signal WHERE '{}'::timestamp <= signal.timestamp AND signal.timestamp <= '{}'::timestamp ORDER BY signal_id ASC;""".format(start,end)
                oc = queryOld(checkContent)[0]
                nc = queryNew(checkContent)[0]
                onlyInOld = set(oc).difference(set(nc))
    
                if onlyInOld != set():
                    print("Content - in Old and not in New {}:\n{}".format((start,end),onlyInOld))
    ### ---