Instructions for data users

All data is store in a PostgrSQL database so that arbitrary queries can be performed. The figure below shows the database layout.

_images/DataModel.svg

Connect to data pool

To connect with the database you need the following information from your admin:

  • The host, e.g. “a.server.com”

  • The port of the database, e.g. 5432

  • The name of the database, e.g. “datapool”

  • A database user name, e.g “data_user”

  • The database password

You can connect directly to the database via psql. However, it is more convenient to load the data required directly in the environment used for further analysis:

R

With the RPostgreSQL package data can be loaded directly into R.

Python

Different options exist, psycopg2 is widly used. The datapool_client package uses psycopg2 and offers some tailored high level functionality.

Example SQL queries

The SQL language may look cumbersome at first. However, it gives a lot of flexibility and allows to express even very complex queries. This SQL tutorial is a helpful reference.

Also note that the DatapoolR package provides functions to simplify common queries.

List all data sources:

List all data sources:

SELECT * FROM source;

List all sites and check how many images a site has:

SELECT
    name,
    coord_x,
    coord_y,
    coord_z,
    street,
    postcode,
    COUNT(picture.filename),
    site.description
FROM site
    LEFT JOIN picture ON site.site_id=picture.site_id
GROUP BY site.site_id;

Get all signals measured at site site_A within a given time interval t_start, t_end:

SELECT
    signal.timestamp,
    value,
    unit,
    parameter.name,
    source_type.name,
    source.name
FROM signal
    INNER JOIN site ON signal.site_id = site.site_id
    INNER JOIN parameter ON signal.parameter_id = parameter.parameter_id
    INNER JOIN source ON signal.source_id = source.source_id
    INNER JOIN source_type ON source.source_type_id = source_type.source_type_id
WHERE site.name = 'site_A'
    AND signal.timestamp >= 't_start'::timestamp
    AND signal.timestamp <= 't_end'::timestamp;

Get all signals of a certain source and parameter source_A, parameter_A within a given time interval t_start, t_end:

WITH source_ids AS (
    SELECT source_id::integer FROM source WHERE source.name = 'source_A'
), parameter_ids AS (
    SELECT parameter_id::integer FROM parameter WHERE parameter.name = 'parameter_A'
)
SELECT
    signal.timestamp,
    value,
    parameter.unit,
    parameter.name,
    source.name,
    source.serial,
    source_type.name,
    site.name
FROM signal
    INNER JOIN site ON signal.site_id = site.site_id
    INNER JOIN parameter ON signal.parameter_id = parameter.parameter_id
    INNER JOIN source ON signal.source_id = source.source_id
    INNER JOIN source_type ON source.source_type_id = source_type.source_type_id
WHERE signal.source_id = ANY(ARRAY(SELECT source_id::integer FROM source_ids))
    AND signal.parameter_id = ANY(ARRAY(SELECT parameter_id::integer FROM parameter_ids))
    AND 't_start'::timestamp <= signal.timestamp
    AND signal.timestamp <= 't_end'::timestamp
ORDER BY signal.timestamp ASC;

Retrieve the parameters that are recorded for a certain source_type all signals measured at site source_type_A:

WITH parameter_ids AS (
    WITH source_ids AS (
        WITH source_type_ids AS (
            SELECT source_type_id::integer FROM source_type WHERE source_type.name = 'source_type_A'
        )
        SELECT source_id::integer FROM source WHERE source_type_id =
        ANY(ARRAY(SELECT source_type_id::integer FROM source_type_ids))
    )
    SELECT DISTINCT parameter_id::integer FROM signal WHERE source_id =
    ANY(ARRAY(SELECT source_id::integer FROM source_ids))
)
SELECT parameter.name FROM parameter
WHERE parameter.parameter_id = ANY(ARRAY(SELECT parameter_id::integer FROM parameter_ids));

Retrieve all sites that belong to a certain source source_A:

WITH site_ids AS (
    WITH source_ids AS (
        SELECT source_id::integer FROM source WHERE source.name = 'source_A'
    )
    SELECT DISTINCT site_id::integer FROM signal WHERE source_id =
    ANY(ARRAY(SELECT source_id::integer FROM source_ids))
)
SELECT site.name FROM site
WHERE site.site_id = ANY(ARRAY(SELECT site_id::integer FROM site_ids));

Count the entries for all data in the database and group by source, and parameter, aggregated weekly:

WITH count_table AS (
    SELECT
        count(parameter_id),
        parameter_id,
        source_id,
        date_trunc('week', timestamp)
    FROM signal
    GROUP BY
        date_trunc('week', timestamp),
        parameter_id,
        source_id
)
SELECT
    count_table.count AS value_count,
    parameter.name AS parameter_name,
    source.name AS source_name,
    count_table.date_trunc AS date_trunc
FROM count_table
INNER JOIN parameter ON parameter.parameter_id = count_table.parameter_id
INNER JOIN source ON source.source_id = count_table.source_id
order by date_trunc desc;

Get all source that have data of a parameter X parameter_X:

WITH parameter_ids as (
    SELECT parameter_id::integer FROM parameter WHERE parameter.name = 'parameter_X'
), source_ids as (
    SELECT DISTINCT source_id::integer FROM signal
    WHERE signal.parameter_id IN (
        SELECT parameter_id::integer FROM parameter_ids
    )
)
SELECT source.name from source
WHERE source.source_id IN (
    SELECT source_id::integer from source_ids
);

Get a datetime range for all data stored for a certain source_X:

WITH source_ids as (
    SELECT source_id::integer FROM source WHERE source.name = 'source_X'
), signal_group as (
    SELECT distinct signal.timestamp::timestamp FROM signal
    WHERE signal.source_id IN (
        SELECT source_id::integer FROM source_ids
    )
)
SELECT DISTINCT MIN(signal_group.timestamp), MAX(signal_group.timestamp) from signal_group
ORDER BY MIN(signal_group.timestamp), MAX(signal_group.timestamp);

Get the special value definition for a source_type X source_type_X

SELECT
    special_value_definition.numerical_value
FROM source_type
    LEFT JOIN special_value_definition ON source_type.source_type_id = special_value_definition.source_type_id
WHERE source_type.name='source_type_X'
GROUP BY special_value_definition.numerical_value;

Get the source_type for a source X source_X

SELECT
    source_type.name
FROM source
LEFT JOIN source_type ON source.source_type_id = source_type.source_type_id
WHERE source.name='source_X';

Get the n number_of_signals newest signals:

SELECT
    timestamp,
    value,
    unit,
    parameter_name,
    source_name,
    source_serial,
    source_type_name
FROM (
    SELECT
        signal.signal_id   AS signal_id,
        signal.timestamp,
        value,
        unit,
        parameter.NAME     AS parameter_name,
        source.NAME        AS source_name,
        source.serial      AS source_serial,
        source_type.NAME   AS source_type_name
    FROM signal
        INNER JOIN site ON signal.site_id = site.site_id
        INNER JOIN parameter ON signal.parameter_id = parameter.parameter_id
        INNER JOIN source ON signal.source_id = source.source_id
        INNER JOIN source_type ON source.source_type_id = source_type.source_type_id
    WHERE timestamp >= (Now() - interval '1 week')
    ORDER BY signal.timestamp DESC limit 'number_of_signals'
) as signals

GROUP BY
    signals.timestamp,
    signals.value,
    signals.signal_id,
    signals.parameter_name,
    signals.unit,
    signals.source_type_name,
    signals.source_name,
    signals.source_serial;

Get the latest signals for each source.

SELECT
    source.name,
    MAX(signal.timestamp)
FROM signal
    INNER JOIN source ON signal.source_id = source.source_id
GROUP BY source.name
ORDER BY MAX(signal.timestamp) ASC;