Instructions for data provider ============================== Data provider is usually the scientist performing measurements in the field. System Architecture ------------------- Data flow model :download:`specification `. .. image:: graphics/DataFlowModel.svg .. NOTE:: The data provider must make sure that i) the raw data arrive in the landing-zone at the right place ii) a conversions script exist to standardize the raw data, and iii) meta-data are provided. General workflow ---------------- Working on the real landing-zone would be dangerous. Therefore, all development and testing is done on a copy of the landing-zone. The datapool provides a command to create development landing-zones. A development landing-zone can have any names, but let's call it ``dlz`` for now: .. code-block:: none $ pool start-develop dlz This creates a folder (a copy of the real landing-zone) called ``dlz`` in the home directory. You can see how the created landing zone looks like with ``ls -l dlz``. The datapool provides various checks to ensure that the provided conversion scripts and meta-data are consistent. The checks are ran by: .. code-block:: none $ pool check dlz If everything is fine, modify the develop landing-zone (e.g. add a new sensor) according to the instructions given below. After the modifications run the checks again. .. code-block:: none $ pool check dlz It is recommended to execute this checks after any small changes. If this succeeds, update the operational landing zone: .. code-block:: none $ pool update-operational dlz All future raw data should be delivered directly into the operational database. In the following sections, the different types of modifications/additions are explained. Add Raw Data to Existing Source -------------------------------- Raw data files are written to the respective ``data/`` folders in the operational landing zone as follows: 1. A new file, for example ``data.inwrite``, is created and data are written to this file. 2. Once the file content is complete and the corresponding file handle is closed, the file is renamed to ``data-TIMESTAMP.raw``. .. NOTE:: Files must start with ``data`` and end with ``.raw``! The actual format of ``TIMESTAMP`` is not fixed but must be unique string, that starts with a dash ``-``, and can be temporarily ordered. Encoding a full date and time string will help the users and developers to inspect and find files, especially if present in the backup zone. This procedure is called *write-rename pattern* and avoids conversion of incomplete data files. The risk for such a race condition depends on the size of the incoming data files and other factors and is probably very low. But running a data pool over a longer time span increases this risk and could result in missing data in the data base. Add Site -------- The addition of a new site can be initiated with the command: .. code-block:: none $ pool add site dlz The name, acting as the key, has to be correctly set. The rest of the required information can be added to the ``site.yaml`` file later on. In order to add a new measuring site, the following information has to be prepared: - Name: Name of the site - Description: Free Text describing the particularities of the site - Street, City and Coordinates (CH1903+/LV95): Specifying where the site is located - Pictures (optional): Pictures relating to the site can be specified. Pictures are normally stored in the ``images`` folder of the specific site. The structure of the file ``site.yaml`` where this information is stored, within the sites folder of the landingzone, has to be the same as in the example below: .. include:: examples/site_example.rst To test the changes run: .. code-block:: none $ pool check dlz If the obligatory check succeeds, update the operational landing zone: .. code-block:: none $ pool update-operational dlz Add or Modify Parameters ------------------------ Adding of a new parameter can be initiated with the command: .. code-block:: none $ pool add parameter dlz The file ``parameters.yaml`` is stored in the ``data`` folder and contains all the parameters. The information to be included are: - Name: Name of the Parameter - Unit: Specifies the unit of the Parameter - Description: Additional description of the parameter. In case there is no description required, the field can remain empty. The syntax has to match the following example (note the dash in the first line): .. include:: examples/parameter_example.rst Add a new Source Type --------------------- The addition of a new source type can be initiated with the command: .. code-block:: none $ pool add source_type dlz The name, acting as the key, has to be correctly set. The rest of the required information can be added to the ``source_type.yaml`` file later on. .. include:: examples/source_type_example.rst Again, test the modifications and update the operational landing zone if the tests are successful. .. code-block:: none $ pool check dlz $ pool update-operational dlz Add and Modify Projects ----------------------- Adding of a new project can be initiated with the command: .. code-block:: none $ pool add project dlz A name and description need to be provided. .. include:: examples/project_example.rst Again, test the modifications and update the operational landing zone if the tests are successful. .. code-block:: none $ pool check dlz $ pool update-operational dlz Add Source Instance ------------------- The addition of a new source type can be initiated with the command: .. code-block:: none $ pool add source dlz The name, acting as the key, has to be correctly set. The rest of the required information can be added to the ``source.yaml`` file later on. .. include:: examples/source_example.rst Again, test the modifications and update the operational landing zone if the tests are successful. .. code-block:: none $ pool check dlz $ pool update-operational dlz Conversion of raw data ~~~~~~~~~~~~~~~~~~~~~~ The files arriving in the landing zone are called *raw data*. Every raw data file must be converted into a so called *standardized file* by a conversion script. The format of the standardized files is defined below. Typically, every source instance needs an individually adapted conversion script. Additionally every conversion script has a *yaml* file in which the number of **header lines** and the **block size** for the conversion can be specified. .. include:: examples/conversion_spec_example.rst Since version 0.6 it is also possible to use conversion scripts source type wide. This is especially useful if the data of multiple sources is collected in a single file. Standardized file format ~~~~~~~~~~~~~~~~~~~~~~~~ The standardized file format for the input data is a ``csv`` file with either six or four columns. It must adhere the following standards: - File format: ``csv`` file with semicolon delimited (``;``) - Data format: ``yyyy-mm-dd hh:mm:ss`` - Column names: The first row contains the column names. The first three are always: ``timestamp``, ``parameter``, ``value``. Next either the three columns ``x``, ``y``, ``z``, or a single column ``site`` must be given. The parameter must exisit in the ``parameters.yaml`` and have the exactly same name (see above). For the source type based option another column named ``source`` must be added. The referred ``source`` must exist for the source type. - ``value`` column: Must contain only numerical values. Missing values (``NULL``, ``NA``, or similar) are not allowed. - The z-coordinate columns may be empty. Example standardized file format with coordinates ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +---------------------+-----------------------+--------+---------+---------+-----+ | timestamp | parameter | value | x | y | z | +=====================+=======================+========+=========+=========+=====+ | 2013-11-13 10:06:00 | Water Level | 148.02 | 2682558 | 1239404 | | +---------------------+-----------------------+--------+---------+---------+-----+ | 2013-11-13 10:08:00 | Water Level | 146.28 | 2682558 | 1239404 | | +---------------------+-----------------------+--------+---------+---------+-----+ | 2013-11-13 10:10:00 | Average Flow Velocity | 0.64 | 2682558 | 1239404 | 412 | +---------------------+-----------------------+--------+---------+---------+-----+ | ... | ... | ... | ... | ... | | +---------------------+-----------------------+--------+---------+---------+-----+ Example standardized file format with site ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +---------------------+-----------------------+--------+--------+ | timestamp | parameter | value | site | +=====================+=======================+========+========+ | 2013-11-13 10:06:00 | Water Level | 148.02 | zurich | +---------------------+-----------------------+--------+--------+ | 2013-11-13 10:08:00 | Water Level | 146.28 | zurich | +---------------------+-----------------------+--------+--------+ | 2013-11-13 10:10:00 | Average Flow Velocity | 0.64 | zurich | +---------------------+-----------------------+--------+--------+ | ... | ... | ... | ... | +---------------------+-----------------------+--------+--------+ Example standardized file format with site and source ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +---------------------+-----------------------+--------+--------+-----------------+ | timestamp | parameter | value | site | source | +=====================+=======================+========+========+=================+ | 2013-11-13 10:06:00 | Water Level | 148.02 | zurich | OM_Ultrasonic | +---------------------+-----------------------+--------+--------+-----------------+ | 2013-11-13 10:08:00 | Ammonia | 34.76 | zurich | E&H_Ammonia | +---------------------+-----------------------+--------+--------+-----------------+ | 2013-11-13 10:10:00 | Turbidity | 52.7 | zurich | Trio_Spectro | +---------------------+-----------------------+--------+--------+-----------------+ | ... | ... | ... | ... | ... | +---------------------+-----------------------+--------+--------+-----------------+ Conversion script ~~~~~~~~~~~~~~~~~ The conversion script must define a *function* which reads raw data and write an output file (a standardized file). The first argument if this function is the path to the input raw data, the second argument the path to the resulting file. The follwing points should be considered when writing an conversion script: - Indicate corrupt input data by throwing an exception within a conversion script. A informative error message is helpful and will be logged. - If a converson script writes to ``stdout`` (i.e. normal ``print()`` commands) this may not appear in the datapool log file and thus might be overseen. - All required third party modules, packages, or libraries must be installed globally. Do not try to install them within a script. The following code snippets show how a conversion script could look like for different languages. R ~ - The file must be named ``conversion.r``. - The function must be named ``convert``. .. include:: examples/r_example.rst Necessary source-specific changes include: - read.delim function: adapt to the provided format - special values - position - time - offline/online loggers - matrix Julia ~~~~~ - The function must be named ``convert``. - The name of the julia file and the declared module must be the same (up to the ``.jl`` file extension). So the file containing the module ``conversion_lake_zurich`` must be saved as ``conversion_lake_zurich.jl``. - Further the module and file name must be unique within the landing zone. .. include:: examples/julia_example.rst Python ~~~~~~ .. include:: examples/python_example.rst Matlab ~~~~~~ - The function must be named ``convert``. - The file name must be named ``convert.m``. .. include:: examples/matlab_example.rst