Skip to content

PTI-Clima/data_flow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

611 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

image data_flow

A complete automated pipeline for processing quality-controlled meteorological station data into gap-filled, homogenised time series and daily gridded datasets. It is the second stage of the PTI+ Clima climate data workflow, and must be run after the quality_control pipeline has been completed.

Data flow diagram

Purpose

The pipeline takes as input the quality-controlled (QC) station data produced by the quality_control repository and applies a chain of processing steps:

  1. Pre-processing (pp) — Crops to the analysis period, aggregates multi-variable measurements, classifies stations as candidates or auxiliaries, and computes inter-station distance and correlation matrices.
  2. Gap-filling (gf) — Estimates missing values in candidate station series using information from neighbouring stations and configurable regression methods (difference, ratio, log-ratio, quantile-mapping).
  3. Gap-filling validation (gf_valid) — Leave-one-out cross-validation of the gap-filling procedure, producing accuracy statistics and per-station diagnostic reports.
  4. Homogenisation (hg) — Detects and corrects systematic shifts (inhomogeneities) in the station series using the Standard Normal Homogeneity Test (SNHT; Alexandersson 1986), applied recursively in daily and monthly modes.
  5. Climatologies (cl) — Computes per-station daily climatological statistics (mean, standard deviation, 5th/95th percentiles, extremes) using a 31-day centred moving window.
  6. Gridding (gr) — Creates daily regular grids by universal kriging with elevation (alt) and distance-to-sea (dis) as covariates, covering mainland Spain / Balearics and the Canary Islands separately.
  7. Gridding validation (gr_valid) — Leave-one-out cross-validation of the spatial interpolation.
  8. Grid comparison (gr_comp) — Compares the produced grids against reference gridded products (available for tmax and tmin only).

Outputs are daily NetCDF grids and per-stage .rds R objects containing all data, metadata, and configuration options.

Repository Structure

data_flow/
├── R/
│   ├── main.R               ← Entry point; orchestrates the full pipeline
│   ├── functions.R          ← All functions and R5 class definitions
│   ├── config.yml           ← Per-variable configuration options
│   ├── pp_report.Rmd        ← Pre-processing summary report template
│   ├── gf_report.Rmd        ← Gap-filling summary report template
│   ├── gf_validation.Rmd    ← Gap-filling validation report template
│   ├── hg_report.Rmd        ← Homogenisation summary report template
│   ├── gr_report.Rmd        ← Gridding summary report template
│   ├── gr_validation.Rmd    ← Gridding validation report template
│   └── gr_comparison.Rmd    ← Grid comparison report template
├── data_raw/
│   ├── grd_pen.rda          ← Support grid: mainland Spain + Balearics (UTM 30N)
│   └── grd_can.rda          ← Support grid: Canary Islands (UTM 28N)
└── data/                    ← Output directory (created at runtime)
    ├── pp/                  ← Pre-processed objects
    ├── gf/                  ← Gap-filled objects
    ├── hg/                  ← Homogenised objects
    ├── cl/                  ← Climatology objects
    └── gr/                  ← Daily grids and NetCDF files

The pipeline is driven by a single script (R/main.R) that reads config.yml and sources all functions from functions.R. Each processing stage produces an .rds file (an R5 reference-class object) that is the input for the next stage.

Data modes

The pipeline can operate in two modes:

Database mode (default): data are read from and written to a PostgreSQL database (aemet on dana-sc-database). Quality-controlled input data are read from the quality_control schema; intermediate results are written to the data_flow schema. This is the intended production mode.

Local mode (flag -l / --local): data are read from and written to local .rds and .RData files. In local mode, the QC input files (.RData files produced by quality_control) must be present in the path specified by config$dir$qc_dir. This mode is useful for development, offline work, and reproducing results without database access.

Prerequisites

  • R ≥ 4.2
  • R packages (installed automatically via pacman): tidyverse, argparser, config, chron, Rfast, snowfall, future, terra, sf, rnaturalearth, gstat, ncdf4, abind, stringi, RPostgres, dplyr, automap, FNN
  • Pandoc (for rendering R Markdown reports; ships with RStudio)
  • The output of the quality_control pipeline, available either in the database or locally

Optionally, set the Pandoc path in the main.R preamble to match your installation:

Sys.setenv(RSTUDIO_PANDOC = "/Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools")

Installation

git clone https://github.com/PTI-Clima/data_flow.git
cd data_flow

Open data_flow.Rproj in RStudio, or set the working directory manually:

setwd("path/to/data_flow")

R package dependencies are installed automatically when main.R runs (via pacman::p_load).

Usage

Run from a terminal:

Rscript R/main.R <var> [options]

Get the full help message:

Rscript R/main.R --help
usage: main.R [--] [--help] [--trial] [--local] [--no_global_rep]
              [--no_indiv_rep] [--opts OPTS] [--procs PROCS]
              [--verbosity VERBOSITY] [--ncores NCORES] var

positional arguments:
  var              variable to process: one of tmax, tmin, trange, pr, hr,
                   ssrd, ws (and others as added to config.yml)

flags:
  -h, --help       show this help message and exit
  -t, --trial      enable trial mode (subset of data for testing)
  -l, --local      enable local mode (read/write local files, no database)
  -n, --no_global_rep   do not produce global HTML report
      --no_indiv_rep    do not produce per-station HTML reports

optional arguments:
  -p, --procs      processes to run: one or more of pp, gf, gf_valid, hg,
                   cl, gr, gr_valid, gr_comp (default: all, in order)
  -v, --verbosity  verbosity level: 0 = silent, >0 = status messages (default: 1)
  -c, --ncores     number of CPU cores (0 = all available; default: 1)

Examples

Run the full pipeline for maximum temperature, using the database:

Rscript R/main.R tmax

Run only gap-filling and homogenisation in local mode:

Rscript R/main.R pr -l -p gf hg

Run a quick trial with only the first 100 stations and no reports:

Rscript R/main.R tmax -t -l -n --no_indiv_rep

Run for all configured variables in sequence (shell script):

#!/bin/bash
for var in tmax tmin trange pr hr ssrd ws; do
    Rscript R/main.R $var -l
done

Configuration

All per-variable settings are in R/config.yml. There is a default: section with settings shared across variables; individual variable sections override these defaults. The main fields are:

Field Description
dir$qc_dir Path to quality-control output (input to pp)
dir$pp_dir Output directory for pre-processed objects
dir$gf_dir Output directory for gap-filled objects
dir$hg_dir Output directory for homogenised objects
dir$cl_dir Output directory for climatology objects
dir$gr_dir Output directory for gridded data
period_analysis$start/end Analysis period (currently 1961-01-01 to 2025-10-31)
var$name Internal variable identifier (e.g., "tmax")
var$AEMET_name Raw AEMET variable code(s) in the QC data
var$scale "relative" (e.g., temperature) or "absolute" (e.g., precipitation)
var$unit Physical unit string
var$factor Scale factor applied to raw AEMET values
var$range Physical plausibility range (lower, upper)
var$outlier_tolerance Percentage above/below climatological extremes to tolerate
cand$min_years$active Minimum years required for an active station to be a candidate
cand$min_years$inactive Minimum years required for an inactive station
cand$exclude_auto Logical: exclude automatic stations from candidates
aux$min_years Minimum years required for an auxiliary station
aux$distance Distance metric for auxiliary station selection: "correlation" or "euclidean"
aux$max_dist_km Maximum allowed distance (km) for an auxiliary station
gf$method Gap-filling method: "diff", "ratio", "logratio", "direct", or "qmap"
gf$seasonal Logical: use seasonally varying coefficients
hg$correction Correction type: "daily", "monthly", "annual", or "none"
hg$method Homogenisation method: "diff" or "ratio"

Supported variables

Config key Description AEMET source variable(s) Unit
tmax Maximum daily temperature TMAX °C
tmin Minimum daily temperature TMIN °C
trange Daily thermal range (Tmax − Tmin) TMAX, TMIN °C
pr Daily total precipitation P mm day⁻¹
hr Mean daily relative humidity (4 synoptic obs.) HU00, HU07, HU13, HU18 %
ssrd Daily total global radiation RGLODIA, TOTSOL kJ m⁻² day⁻¹
ws Mean daily wind speed (4 synoptic obs.) VEL_00, VEL_07, VEL_13, VEL_18 km h⁻¹

Additional variables can be added by creating new sections in config.yml following the same structure.

Output

Each pipeline stage writes results to its configured output directory:

Stage Output Description
pp <pp_dir>/<var>.rds Pre-processed R5 object
pp <pp_dir>/<var>.html Summary report
gf <gf_dir>/<var>.rds Gap-filled R5 object
gf <gf_dir>/<var>.html Summary report
gf <gf_dir>/<var>/<stn>.html Per-station reports
hg <hg_dir>/<var>.rds Homogenised R5 object
hg <hg_dir>/<var>.html Summary report
gr <gr_dir>/<var>_pen.nc Daily grids, mainland (NetCDF)
gr <gr_dir>/<var>_can.nc Daily grids, Canary Islands (NetCDF)

The .rds objects are R5 reference-class instances containing all data, metadata, configuration, and embedded methods, so they are fully self-contained and reproducible.

Relationship to quality_control

This repository is the direct downstream successor of quality_control. The QC pipeline produces cleaned station data (as .RData files or database tables in the quality_control schema); data_flow reads this output as its starting point. Neither pipeline can substitute for the other:

  • quality_control must be run first to remove erroneous observations.
  • data_flow then fills gaps, removes inhomogeneities, and interpolates the cleaned series onto a regular grid.

In production, both pipelines write to and read from a shared PostgreSQL database. In local mode, the QC .RData files must be placed in the directory specified by config$dir$qc_dir.

Further Documentation

See docs/full_documentation.md for detailed descriptions of each pipeline stage, all function signatures, output data structures, and suggestions for further development.

License

GPL-3 or later. See http://www.gnu.org/licenses/gpl.txt.

Authors

Santiago Beguería, LCSC-CSIC (https://lcsc.csic.es).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages