Instructions for data users
All data is store in a PostgrSQL database so that arbitrary queries can be performed. The figure below shows the database layout.
Connect to data pool
To connect with the database you need the following information from your admin:
The host, e.g. “a.server.com”
The port of the database, e.g. 5432
The name of the database, e.g. “datapool”
A database user name, e.g “data_user”
The database password
You can connect directly to the database via psql. However, it is more convenient to load the data required directly in the environment used for further analysis:
R
With the RPostgreSQL package data can be loaded directly into R.
Python
Different options exist, psycopg2 is widly used. The datapool_client package uses psycopg2 and offers some tailored high level functionality.
Example SQL queries
The SQL language may look cumbersome at first. However, it gives a lot of flexibility and allows to express even very complex queries. This SQL tutorial is a helpful reference.
Also note that the DatapoolR package provides functions to simplify common queries.
List all data sources:
List all data sources:
SELECT * FROM source;
List all sites and check how many images a site has:
SELECT
name,
coord_x,
coord_y,
coord_z,
street,
postcode,
COUNT(picture.filename),
site.description
FROM site
LEFT JOIN picture ON site.site_id=picture.site_id
GROUP BY site.site_id;
Get all signals measured at site site_A
within a given time interval t_start, t_end
:
SELECT
signal.timestamp,
value,
unit,
parameter.name,
source_type.name,
source.name
FROM signal
INNER JOIN site ON signal.site_id = site.site_id
INNER JOIN parameter ON signal.parameter_id = parameter.parameter_id
INNER JOIN source ON signal.source_id = source.source_id
INNER JOIN source_type ON source.source_type_id = source_type.source_type_id
WHERE site.name = 'site_A'
AND signal.timestamp >= 't_start'::timestamp
AND signal.timestamp <= 't_end'::timestamp;
Get all signals of a certain source and parameter source_A
, parameter_A
within a given time interval t_start, t_end
:
WITH source_ids AS (
SELECT source_id::integer FROM source WHERE source.name = 'source_A'
), parameter_ids AS (
SELECT parameter_id::integer FROM parameter WHERE parameter.name = 'parameter_A'
)
SELECT
signal.timestamp,
value,
parameter.unit,
parameter.name,
source.name,
source.serial,
source_type.name,
site.name
FROM signal
INNER JOIN site ON signal.site_id = site.site_id
INNER JOIN parameter ON signal.parameter_id = parameter.parameter_id
INNER JOIN source ON signal.source_id = source.source_id
INNER JOIN source_type ON source.source_type_id = source_type.source_type_id
WHERE signal.source_id = ANY(ARRAY(SELECT source_id::integer FROM source_ids))
AND signal.parameter_id = ANY(ARRAY(SELECT parameter_id::integer FROM parameter_ids))
AND 't_start'::timestamp <= signal.timestamp
AND signal.timestamp <= 't_end'::timestamp
ORDER BY signal.timestamp ASC;
Retrieve the parameters that are recorded for a certain source_type all signals measured at site source_type_A
:
WITH parameter_ids AS (
WITH source_ids AS (
WITH source_type_ids AS (
SELECT source_type_id::integer FROM source_type WHERE source_type.name = 'source_type_A'
)
SELECT source_id::integer FROM source WHERE source_type_id =
ANY(ARRAY(SELECT source_type_id::integer FROM source_type_ids))
)
SELECT DISTINCT parameter_id::integer FROM signal WHERE source_id =
ANY(ARRAY(SELECT source_id::integer FROM source_ids))
)
SELECT parameter.name FROM parameter
WHERE parameter.parameter_id = ANY(ARRAY(SELECT parameter_id::integer FROM parameter_ids));
Retrieve all sites that belong to a certain source source_A
:
WITH site_ids AS (
WITH source_ids AS (
SELECT source_id::integer FROM source WHERE source.name = 'source_A'
)
SELECT DISTINCT site_id::integer FROM signal WHERE source_id =
ANY(ARRAY(SELECT source_id::integer FROM source_ids))
)
SELECT site.name FROM site
WHERE site.site_id = ANY(ARRAY(SELECT site_id::integer FROM site_ids));
Count the entries for all data in the database and group by source, and parameter, aggregated weekly:
WITH count_table AS (
SELECT
count(parameter_id),
parameter_id,
source_id,
date_trunc('week', timestamp)
FROM signal
GROUP BY
date_trunc('week', timestamp),
parameter_id,
source_id
)
SELECT
count_table.count AS value_count,
parameter.name AS parameter_name,
source.name AS source_name,
count_table.date_trunc AS date_trunc
FROM count_table
INNER JOIN parameter ON parameter.parameter_id = count_table.parameter_id
INNER JOIN source ON source.source_id = count_table.source_id
order by date_trunc desc;
Get all source that have data of a parameter X parameter_X
:
WITH parameter_ids as (
SELECT parameter_id::integer FROM parameter WHERE parameter.name = 'parameter_X'
), source_ids as (
SELECT DISTINCT source_id::integer FROM signal
WHERE signal.parameter_id IN (
SELECT parameter_id::integer FROM parameter_ids
)
)
SELECT source.name from source
WHERE source.source_id IN (
SELECT source_id::integer from source_ids
);
Get a datetime range for all data stored for a certain source_X
:
WITH source_ids as (
SELECT source_id::integer FROM source WHERE source.name = 'source_X'
), signal_group as (
SELECT distinct signal.timestamp::timestamp FROM signal
WHERE signal.source_id IN (
SELECT source_id::integer FROM source_ids
)
)
SELECT DISTINCT MIN(signal_group.timestamp), MAX(signal_group.timestamp) from signal_group
ORDER BY MIN(signal_group.timestamp), MAX(signal_group.timestamp);
Get the special value definition for a source_type X source_type_X
SELECT
special_value_definition.numerical_value
FROM source_type
LEFT JOIN special_value_definition ON source_type.source_type_id = special_value_definition.source_type_id
WHERE source_type.name='source_type_X'
GROUP BY special_value_definition.numerical_value;
Get the source_type for a source X source_X
SELECT
source_type.name
FROM source
LEFT JOIN source_type ON source.source_type_id = source_type.source_type_id
WHERE source.name='source_X';
Get the n number_of_signals
newest signals:
SELECT
timestamp,
value,
unit,
parameter_name,
source_name,
source_serial,
source_type_name
FROM (
SELECT
signal.signal_id AS signal_id,
signal.timestamp,
value,
unit,
parameter.NAME AS parameter_name,
source.NAME AS source_name,
source.serial AS source_serial,
source_type.NAME AS source_type_name
FROM signal
INNER JOIN site ON signal.site_id = site.site_id
INNER JOIN parameter ON signal.parameter_id = parameter.parameter_id
INNER JOIN source ON signal.source_id = source.source_id
INNER JOIN source_type ON source.source_type_id = source_type.source_type_id
WHERE timestamp >= (Now() - interval '1 week')
ORDER BY signal.timestamp DESC limit 'number_of_signals'
) as signals
GROUP BY
signals.timestamp,
signals.value,
signals.signal_id,
signals.parameter_name,
signals.unit,
signals.source_type_name,
signals.source_name,
signals.source_serial;
Get the latest signals for each source.
SELECT
source.name,
MAX(signal.timestamp)
FROM signal
INNER JOIN source ON signal.source_id = source.source_id
GROUP BY source.name
ORDER BY MAX(signal.timestamp) ASC;