Instructions for data provider¶
Data provider is usually the scientist performing measurements in the field.
System Architecture¶
Data flow model specification
.
Note
- The data provider must make sure that
the raw data arrive in the landing-zone at the right place
a conversions script exist to standardize the raw data, and
meta-data are provided.
General workflow¶
Working on the real landing-zone would be dangerous. Therefore, all
development and testing is done on a copy of the
landing-zone. The datapool provides a command to create development
landing-zones. A development landing-zone can have any names, but let’s call it
dlz
for now:
$ pool start-develop dlz
This creates a folder (a copy of the real landing-zone) called dlz
in the home directory. You can
see how the created landing zone looks like with ls -l dlz
.
The datapool provides various checks to ensure that the provided conversion scripts and meta-data are consistent. The checks are ran by:
$ pool check dlz
If everything is fine, modify the develop landing-zone (e.g. add a new sensor) according to the instructions given below. After the modifications run the checks again.
$ pool check dlz
It is recommended to execute this checks after any small changes.
When you are done with the changes and if the the command pool check
is successful, you can update the operational landing zone:
$ pool update-operational dlz
In the following sections, the different types of operations (modifications, additions,..) are explained.
Add Raw Data to Existing Source¶
Raw data files are written to the respective data/
folders in the
operational landing zone as follows:
A new file, for example
data.inwrite
, is created and data are written to this file.Once the file content is complete and the corresponding file handle is closed, the file is renamed to
data-TIMESTAMP.raw
.
Note
Files must start with data
and end with .raw
!
The actual format of TIMESTAMP
is not fixed but must be unique string,
that starts with a dash -
, and can be temporarily ordered. Encoding a full date and time string
will help the users and developers to inspect and find files, especially
if present in the backup zone.
This procedure is called write-rename pattern and avoids conversion of incomplete data files. The risk for such a race condition depends on the size of the incoming data files and other factors and is probably very low. But running a data pool over a longer time span increases this risk and could result in missing data in the data base.
This process is usually automatized.
Add Site¶
The addition of a new site can be initiated with the command:
$ pool add site dlz
The name, acting as the key, has to be correctly set. The rest of the required information can be added to the site.yaml
file later on.
The information that needs to be included in the yaml file is the following:
Name: Name of the site
Description: Free Text describing the particularities of the site
Street, City and Coordinates (CH1903+/LV95): Specifying where the site is located
Pictures (optional): Pictures relating to the site can be specified. Pictures are normally stored in the
images
folder of the specific site.
The structure of the file site.yaml
where this information is stored, within the sites folder of the landingzone, has to be the same as in the example below:
name: wildbach_kempttalstr
description: The site is the river wildbach flowing underneath the Kempttalstrasse.
street: Kempttalstrasse
postcode: 8320
city: Fehraltdorf
coordinates:
x: 2698752
y: 1249699
z: 526
pictures:
-
path: images/1.JPG
description: the bridge with the Kempttalstrasse going over the small river Wildbach
-
path: images/2.JPG
description: The zone beneath the bridge where the pressure sensor is installed
-
path: images/3.JPG
description: The installed pressure sensor is within the tube.
country: Switzerland
To test the changes run:
$ pool check dlz
If the obligatory check succeeds, update the operational landing zone:
$ pool update-operational dlz
Add or Modify Parameters¶
Adding of a new parameter can be initiated with the command:
$ pool add parameter dlz
The file parameters.yaml
is stored in the data
folder and contains all the parameters. The information to be included are:
Name: Name of the Parameter
Unit: Specifies the unit of the Parameter
Description: Additional description of the parameter. In case there is no description required, the field can remain empty.
The syntax has to match the following example (note the dash in the first line):
-
name: parameter_1
description: whatever you want to write
unit: xyz
-
name: parameter_2
description: very interesting description
unit: xyz
-
name: parameter_3
description: very detailed stuff
unit: xyz
Add a new Source Type¶
The addition of a new source type can be initiated with the command:
$ pool add source_type dlz
The name, acting as the key, has to be correctly set. The rest of the required information can be added to the source_type.yaml
file later on.
name: source_category_xyz
description: all sources in this category belong to the xyz group
manufacturer: XYZ Group
special_values:
-
description: not a number
categorical_value: NA
numerical_value: -6666.0
Again, test the modifications and update the operational landing zone if the tests are successful.
$ pool check dlz
$ pool update-operational dlz
Add and Modify Projects¶
Adding of a new project can be initiated with the command:
$ pool add project dlz
A name and description need to be provided.
-
title: project_1
description: What a great project!!
-
title: project_2
description: And another great project!!
Again, test the modifications and update the operational landing zone if the tests are successful.
$ pool check dlz
$ pool update-operational dlz
Add Source Instance¶
The addition of a new source type can be initiated with the command:
$ pool add source dlz
The name, acting as the key, has to be correctly set. The rest of the required information can be added to the source.yaml
file later on.
name: bt_dl958_166_luppmenweg
description: Replacement logger for bt_dl922_166_luppmenweg and bt_dl947_166_luppmenweg
serial: 958
project: project_1
Again, test the modifications and update the operational landing zone if the tests are successful.
$ pool check dlz
$ pool update-operational dlz
Conversion of raw data¶
The files arriving in the landing zone are called raw data. Every raw data file must be converted into a so called standardized file by a conversion script. The format of the standardized files is defined below. Typically, every source instance needs an individually adapted conversion script.
Additionally every conversion script has a yaml file in which the number of header lines and the block size for the conversion can be specified.
header_lines: 1
block_size: 50000
Since version 0.6 it is also possible to use conversion scripts source type wide. This is especially useful if the data of multiple sources is collected in a single file.
Standardized file format¶
The standardized file format for the input data is a csv
file with
either six or four columns. It must adhere the following standards:
File format:
csv
file with semicolon delimited (;
)Data format:
yyyy-mm-dd hh:mm:ss
Column names: The first row contains the column names. The first three are always:
timestamp
,parameter
,value
. Next either the three columnsx
,y
,z
, or a single columnsite
must be given. The parameter must exisit in theparameters.yaml
and have the exactly same name (see above). For the source type based option another column namedsource
must be added. The referredsource
must exist for the source type.value
column: Must contain only numerical values. Missing values (NULL
,NA
, or similar) are not allowed.The z-coordinate columns may be empty.
Example standardized file format with coordinates¶
timestamp |
parameter |
value |
x |
y |
z |
---|---|---|---|---|---|
2013-11-13 10:06:00 |
Water Level |
148.02 |
2682558 |
1239404 |
|
2013-11-13 10:08:00 |
Water Level |
146.28 |
2682558 |
1239404 |
|
2013-11-13 10:10:00 |
Average Flow Velocity |
0.64 |
2682558 |
1239404 |
412 |
… |
… |
… |
… |
… |
Example standardized file format with site¶
timestamp |
parameter |
value |
site |
---|---|---|---|
2013-11-13 10:06:00 |
Water Level |
148.02 |
zurich |
2013-11-13 10:08:00 |
Water Level |
146.28 |
zurich |
2013-11-13 10:10:00 |
Average Flow Velocity |
0.64 |
zurich |
… |
… |
… |
… |
Example standardized file format with site and source¶
timestamp |
parameter |
value |
site |
source |
---|---|---|---|---|
2013-11-13 10:06:00 |
Water Level |
148.02 |
zurich |
OM_Ultrasonic |
2013-11-13 10:08:00 |
Ammonia |
34.76 |
zurich |
E&H_Ammonia |
2013-11-13 10:10:00 |
Turbidity |
52.7 |
zurich |
Trio_Spectro |
… |
… |
… |
… |
… |
Conversion script¶
The conversion script must define a function which reads raw data and write an output file (a standardized file). The first argument if this function is the path to the input raw data, the second argument the path to the resulting file.
The follwing points should be considered when writing an conversion script:
Indicate corrupt input data by throwing an exception within a conversion script. A informative error message is helpful and will be logged.
If a converson script writes to
stdout
(i.e. normalprint()
commands) this may not appear in the datapool log file and thus might be overseen.All required third party modules, packages, or libraries must be installed globally. Do not try to install them within a script.
The following code snippets show how a conversion script could look like for different languages.
R¶
The file must be named
conversion.r
.The function must be named
convert
.
# Example conversion script for R
# Aug 04 2017 -- Simon Dicht
# EDIT Jan 08 2018 -- Lukas Keller
convert <- function(raw_file, output_file){
data.raw <- read.delim(raw_file, sep=";", skip=1, header=F, dec = ".", stringsAsFactors=F)
#remove apostrophe so 1'000 can be a number
data.raw[,5] <- gsub("'","", data.raw[,5])
# set special values for OL and NA
data.raw[,4][is.na(data.raw[,5])] <- -9999
data.raw[,5][is.na(data.raw[,5])] <- -9999
data.raw[data.raw == "OL"] <- -8888
# define location ID
pos <- "11e_russikerstr"
time <- as.POSIXct(data.raw[,1], format = "%d.%m.%Y %H:%M:%S", tz="UTC")
time <- format(time, "%Y-%m-%d %H:%M:%S")
# if ncol(data.raw == 8)...
# creating the matrix with all fields
data<-data.frame(time, "flow rate", data.raw[,4],pos)
data<-rbind(data, setNames(data.frame(time, "average flow velocity", data.raw[,5], pos), names(data)))
data<-rbind(data, setNames(data.frame(time, "water level", data.raw[,6], pos), names(data)))
data<-rbind(data, setNames(data.frame(time, "Water temperature", data.raw[,7], pos), names(data)))
data<-rbind(data, setNames(data.frame(time, "battery voltage", data.raw[,8], pos), names(data)))
data<-rbind(data, setNames(data.frame(time, "battery charge", data.raw[,9], pos), names(data)))
data<-rbind(data, setNames(data.frame(time, "internal temperature", data.raw[,11], pos), names(data)))
data<-rbind(data, setNames(data.frame(time, "RSSI", data.raw[,10], pos), names(data)))
# deleate all rows with NA
#data=na.omit(data)
# naming the columns
colnames(data) <- c("timestamp", "parameter", "value","site")
#write.csv2(data, file="Test", row.names=FALSE )
write.table(data, file=output_file, row.names=FALSE, col.names=TRUE,
quote=FALSE, sep=";")
}
Necessary source-specific changes include:
read.delim function: adapt to the provided format
special values
position
time
offline/online loggers
matrix
Julia¶
The function must be named
convert
.The name of the julia file and the declared module must be the same (up to the
.jl
file extension). So the file containing the moduleconversion_lake_zurich
must be saved asconversion_lake_zurich.jl
.Further the module and file name must be unique within the landing zone.
# Example Julia conversion script
# September 27, 2016 -- Alex Hunziker
module conversion_FloDar_Fehraltorf_2
# ---> 1.) load required package (optional)
using DataFrames
function convert(raw_file, output_file)
# ---> 2.) read file
if(!isfile(raw_file))
error("Error: raw_file does not exist.")
end
# the header line contains non-utf8 encoded characters, so we skip this:
dataraw = DataFrame(readtable(raw_file, separator = '\t', skipstart=1, header=false))
names!(dataraw, map(symbol, ["Date Time", "Water Level", "Average Flow Velocity", "Flow",
"Temperature", "Surface Flow Velocity", "Distance",
"Distance Reading Count", "Surcharge Level",
"Peak to Mean Ratio", "Number of Samples", "Battery Voltage"]))
## ---> 3.) test properties
if(size(dataraw, 2) != 12)
error("Imput File has wrong number of columns.")
end
## ---> 4.) add additional information (optional)
#Define coordinate
xcoor = 2682558
ycoor = 1239404
zcoor = ""
## ---> 5.) reformate data
selCol = symbol("Date Time")
time = Dates.DateTime(dataraw[selCol], "dd.mm.yyyy HH:MM")
dataraw[selCol] = Dates.format(time, "yyyy-mm-dd HH:MM")
dataForm = stack(dataraw, [2:12], selCol)
dataForm = dataForm[:, [selCol, :variable, :value]]
dataForm[4] = xcoor
dataForm[5] = ycoor
dataForm[6] = zcoor
names!(dataForm, [:timestamp, :parameter, :value, :x, :y, :z])
deleterows!(dataForm, find(isna(dataForm[:, symbol("value")])))
## ---> 6.) write file
writetable(output_file, dataForm, separator = ';')
end
end
Python¶
# Example Python conversion script
# September 27, 2016 -- Alex Hunziker
# ---> 1.) load required packages (optional)
import os.path
import pandas
def convert(raw_file, output_file):
# ---> 2.) read file
if not os.path.isfile(raw_file):
raise ValueError('Error: Input File does not exist.')
raw_data = pandas.read_csv(raw_file, sep='\t', encoding="latin-1")
colNames = ("Date Time", "Water Level", "Average Flow Velocity", "Flow", "Temperature",
"Surface Flow Velocity", "Distance", "Distance Reading Count",
"Surcharge Level", "Peak to Mean Ratio", "Number of Samples",
"Battery Voltage")
raw_data.columns = colNames
# ---> 3.) test properties
if len(raw_data.columns) != 12:
raise ValueError('Error: Input File has wrong number of columns.')
# ---> 4.) add additional information (optional)
# Define coordinate
xcoor = 2682558
ycoor = 1239404
zcoor = ""
# ---> 5.) reformat data
time = pandas.to_datetime(raw_data['Date Time'], format="%d.%m.%Y %H:%M")
raw_data['Date Time'] = time.apply(lambda x: x.strftime('%Y-%m-%d %H:%M'))
data = pandas.melt(raw_data, id_vars=['Date Time'],
value_vars=list(raw_data.columns[1:12]))
data.columns = ['Date Time', 'parameter', 'value']
data = data.dropna()
data['x'] = xcoor
data['y'] = ycoor
data['z'] = zcoor
## ---> 6.) write file
data.to_csv(output_file, sep=";", index=False)
Matlab¶
The function must be named
convert
.The file name must be named
convert.m
.
%
% SWW-DWH: Example MatLab conversion script
%
% 19/12/2016 - Frank Blumensaat
% Example: conversion('raw_data\data-001.raw','out.dat');
% -------------------------------------------------------
function conversion(fNameIn,fNameOut)
% read full content of the file into 'data'
fid = fopen(fullfile(fNameIn), 'r');
dataRaw = textscan(fid, '%s %f %f %f %f %f %f %f %f %f %f %f', Inf, 'Delimiter','\t','TreatAsEmpty',...
{'NA'},'HeaderLines',1);
fclose(fid);
% possible to include check if 12 columns and numeric val's in col2 - col12
fid = fopen(fullfile(fNameIn), 'r');
names = textscan(fid, '%s %s %s %s %s %s %s %s %s %s %s %s', 1,'Delimiter','\t','HeaderLines',0);
fclose(fid);
% % parse string of TRANSFER time (time stamp) into ML number
datTime = datenum(dataRaw{1,1}(:),'DD.mm.YYYY hh:MM');
% define coordinates
xcoor = ones(length(dataRaw{1}),1).*2682558;
ycoor = ones(length(dataRaw{1}),1).*1239404;
zcoor = zeros(length(dataRaw{1}),1);
% split data matrix acc. to parameter and remove NaNs
for j = 2:size(dataRaw,2)
dataSplit(j-1).var = excise([datTime dataRaw{1,j} xcoor ycoor zcoor]);
end
% some parameter names are not conforming to parameters.yaml:
parametersRaw = {'Level', 'Velocity', 'Surface Velocity', 'PMR', 'NOS', 'Power Supply'};
parametersUniform = {'Water Level', 'Average Flow Velocity', 'Surface Flow Velocity',...
'Peak to Mean Ratio', 'Number of Samples', 'Battery Voltage'};
fixNames = containers.Map(parametersRaw,parametersUniform);
% write processed data to a cell array
celldata = {};
clear celldataTemp
for k = 1:length(dataSplit)
for i = 1:length(dataSplit(k).var)
celldataTemp{i,1} = datestr(dataSplit(k).var(i,1),'yyyy-mm-dd HH:MM:SS'); % following the ISO 8601 data standard
name = char(names{k+1});
% our parameters.yaml does not have the units in (..), so we remove them:
name = regexprep(name, '\(.*\)', '');
% correct some names:
if isKey(fixNames, name)
name = fixNames(name);
end
celldataTemp{i,2} = name;
celldataTemp{i,3} = dataSplit(k).var(i,2);
celldataTemp{i,4} = dataSplit(k).var(i,3);
celldataTemp{i,5} = dataSplit(k).var(i,4);
celldataTemp{i,6} = '';
end
celldata = vertcat(celldata,celldataTemp);
clear celldataTemp
end
%% write selected data to TXT file
fid = fopen(fullfile(fNameOut),'w');
fprintf(fid, '%s; %s; %s; %s; %s; %s \n', 'timestamp', 'parameter', 'value', 'x', 'y', 'z');
[nrows] = size(celldata);
for row = 1:nrows
fprintf(fid,'%s; %s; %f; %d; %d; %d \n',celldata{row,:});
end
fclose(fid);
end
%% function to remove NaN values
function X = excise(X)
X(any(isnan(X)'),:) = [];
end