Instructions for data users

First steps with the Datapool

The datapool is a package that uses a PostgreSQL database to store the sensor data and provides for a series of functionalities in order to add, modify and delete data sources.

But, what is a database? Find more information here.

PostgrSQL is a powerful, open source, object-relational database.

To access the data stored on the database we use SQL (Structured Query Language). It makes sense to get a little familiar with the basic querying functionality in order to be able to access the data. This SQL tutorial is a helpful reference.

The figure below shows the current database layout.

_images/DataModel.svg

Database schema download.

You can find a detailed explanation of the tables included in the datapool on the following link: Database layout.

Connect to the Datapool

To connect with the database you need the following information from your admin:

  • The host, e.g. “a.server.com”

  • The port of the database, e.g. 5432

  • The name of the database, e.g. “datapool”

  • A database user name, e.g “data_user”

  • The database user password

You can connect directly to the database via psql. However, it is more convenient to load the data required directly into the environment used for further analysis.

Python

We recommend using the datapool_client package, which uses psycopg2 and offers some tailored high level functionality.

Datapool Client Package usage

On a clean directory with a virtual environment active, install the datapool_client package following the instructions on the repository.

To set up the configuration to connect to the Datapool, we need to run the following code:

from datapool_client import set_defaults

# connection parameters (example)
instance = "DEFAULT"
host = "ip.to.datapool.host"
port = "5432"
database = "db_name"
user = "db_user_name"
password = "db_user_password"

# this function sets the default parameters programmatically
set_defaults(
  instance=instance,
  host=host,
  port=port,
  database=database,
  user=user,
  password=password
)

After running this code we run the following lines to make sure we are able to access the DataPool.

from datapool_client import DataPool
dp = DataPool()

If the connection is correct we would receive the following message:

You are successfully connected to the database!

Once we have set up the default connection we don’t need to run the set_defaults() function again.

The only exception would be on the case we would like to set up a second connection to another datapool. On that case, we first run set_defaults() with instance = "My_instance_name" and then, we connect to the Datapool with dp = Datapool(instance="My_instance_name").

Example SQL queries using datapool_client package

If we want to run our specific query and return a dataframe:

data_dataframe = dp.query_df(
    "select * from site;"
)

On this case we have queried for all the information on the site table. The output is a pandas dataframe.

source = "your_source_name"
start = "2018-07-11 00:00:00"
end = "2018-07-11 23:55:00"
variable = "your_variable_name"
data = dp.signal.get(
    source_name=source,
    variable_name=variable,
    start=start,
    end=end
)

This query returns all the signals from the source = "your_surce_name" between start and end date, for a defined variable.

We explore a broader usage of sql queries to the datapool on the section Example SQL queries.

R

There is the possibility to use the RPostgreSQL package to connect to the datapool using R.

Example SQL queries

The SQL language may look cumbersome at first. However, it gives a lot of flexibility and allows to express even very complex queries. We will explore a few queries starting from some simple ones to more complex ones, to give an idea of its functionality.

List all the columns (*) from the source table:

SELECT * FROM source;

List all sites with their descriptions and check how many images each site has. A couple of remarks on this query.

  1. We select the name and the description from the site table.

  2. When we write site s as on the line FROM site s, we are creating an alias for the table site.

  3. We left join to the site table the picture table, allowing for the possibility of having sites without any image.

  4. Count(p.picture_id) combined with GROUP BY s.site_id will count all pictures having the same s.site_id

SELECT
      s."name",
      s.description,
      COUNT(p.picture_id)
FROM site s
LEFT JOIN picture p ON p.site_id = s.site_id
GROUP BY s.site_id

Get all signals measured at site site_A within a given time interval t_start, t_end. The correct format for timestamp would be '2017-07-19 18:15:00.000'. On this case, we join the table signal with the tables site, variable, source and source_type and filter the result using the clause WHERE.

SELECT
    s.timestamp,
    s.value,
    v.unit ,
    v."name" AS "Unit Name",
    s3."name" AS "Source Name",
    st."name" AS "Source Type Name"
FROM signal s
    JOIN site s2 ON s2.site_id = s.site_id
    JOIN variable v ON v.variable_id = s.variable_id
    JOIN "source" s3 ON s3.source_id = s.source_id
    JOIN source_type st ON st.source_type_id = s3.source_type_id
WHERE
    s.name = 'Site_A'
    AND s."timestamp" >= 't_start'
    AND s."timestamp" <= 't_end'

Get all signals of a certain source and parameter source_A, parameter_A within a given time interval t_start, t_end. In this case we use the clause ORDER BY to retrieve the data ordered by timestamp ascending.

SELECT
    s."timestamp",
    s.value,
    v.unit,
    v."name",
    s2.name as "Source Name",
    s2.serial,
    st."name" as "Source-type Name",
    s3."name" "Site Name",
    s2."name" "Source Name"
FROM signal s
  JOIN variable v ON v.variable_id = s.variable_id
  JOIN "source" s2 ON s2.source_id = s.source_id
  JOIN source_type st ON st.source_type_id = s2.source_type_id
  JOIN site s3 ON s3.site_id = s.site_id
WHERE s2.name = 'source_A'
  AND v."name" = 'parameter_A'
  AND 't_start'::timestamp <= s.timestamp
  AND s.timestamp <= 't_end'::timestamp
ORDER BY s.timestamp ASC;

One thing to keep in mind is that in SQL, the order of the clauses is important:

  • SELECT

  • FROM/JOIN

  • WHERE

  • GROUP BY

  • HAVING

  • ORDER BY

  • LIMIT/OFFSET

Retrieve the variables that are recorded for a certain source_type source_type_A. DISTINCT is used to just display one line for each variable.

SELECT
    DISTINCT v."name"
FROM variable v
  JOIN signal s ON s.variable_id = v.variable_id
  JOIN "source" s2 ON s2.source_id = s.source_id
  JOIN source_type st on st.source_type_id = s2.source_type_id
WHERE st.name = 'source_type_A'
  AND '2020-07-19 18:15:00.000'::timestamp <= s.timestamp
  AND s.timestamp <= '2020-07-20 18:15:00.000'::timestamp;

Retrieve all sites that belong to a certain source source_A:

SELECT
    s."name"
FROM site s
  JOIN meta_data md ON md.site_id = s.site_id
  JOIN "source" s2 on s2.source_id = md.source_id
WHERE s2.name = 'source_A'

Count the entries for all data in the database and group by source, and variable, aggregated weekly:

WITH count_table AS (
    SELECT
        count(variable_id),
        variable_id,
        source_id,
        date_trunc('week', timestamp)
    FROM signal
    GROUP BY
        date_trunc('week', timestamp),
        variable_id,
        source_id
)
SELECT
    count_table.count AS value_count,
    v.name AS variable_name,
    source.name AS source_name,
    count_table.date_trunc AS date_trunc
FROM count_table
INNER JOIN variable v on v.variable_id  = count_table.variable_id
INNER JOIN source ON source.source_id = count_table.source_id
ORDER BY date_trunc desc;

Get all source that have data of variable <my_variable>:

SELECT s."name"
FROM "source" s
  JOIN signal s2 ON s.source_id = s2.source_id
  JOIN variable v ON v.variable_id = s2.variable_id
WHERE v."name" = '<my_variable>'
GROUP BY s.name;

Get a datetime range for all data stored for a certain <my_source>:

SELECT
  min(s."timestamp"),
  max(s."timestamp")
FROM signal s
JOIN "source" s2 ON s2.source_id = s.source_id
WHERE s2.name = '<my_source>'

Get the special value definition for a source_type X <my_source_type>

SELECT
    svd.numerical_value
FROM source_type st
    LEFT JOIN special_value_definition svd ON st.source_type_id = svd.source_type_id
WHERE st.name='<my_source_type>'
GROUP BY svd.numerical_value;

Get the source_type for a source <my_source>

SELECT
    source_type.name
FROM source
LEFT JOIN source_type ON source.source_type_id = source_type.source_type_id
WHERE source.name='<my_source>';

Get the n <number_of_signals> newest signals:

WITH signals as (
    SELECT
        s.timestamp,
        s.value,
        v.unit,
        v.name AS variable_name,
        s2.NAME        AS source_name,
        s2.serial      AS source_serial,
        st.NAME   AS source_type_name
    FROM signal s
        INNER JOIN site s3 ON s.site_id = s3.site_id
        INNER JOIN variable v  ON s.variable_id = v.variable_id
        INNER JOIN source s2 ON s.source_id = s2.source_id
        INNER JOIN source_type st ON s2.source_type_id = st.source_type_id

    WHERE timestamp >= (Now() - interval '1 month')
    ORDER BY s.timestamp DESC
    limit <number_of_signals>
)

SELECT
    s.timestamp,
    s.value,
    s.unit,
    s.variable_name,
    s.source_name,
    s.source_serial,
    s.source_type_name
FROM signals s
GROUP BY
    s.timestamp,
    s.value,
    s.variable_name,
    s.unit,
    s.source_type_name,
    s.source_name,
    s.source_serial;

Get the latest signals for each source.

SELECT
  swc."Source Name",
  swc.laststamp
FROM source_with_coordinates swc