data_flow

A complete automated pipeline for processing quality-controlled meteorological station data into gap-filled, homogenised time series and daily gridded datasets. It is the second stage of the PTI+ Clima climate data workflow, and must be run after the quality_control pipeline has been completed.

Purpose

The pipeline takes as input the quality-controlled (QC) station data produced by the quality_control repository and applies a chain of processing steps:

Pre-processing (pp) — Crops to the analysis period, aggregates multi-variable measurements, classifies stations as candidates or auxiliaries, and computes inter-station distance and correlation matrices.
Gap-filling (gf) — Estimates missing values in candidate station series using information from neighbouring stations and configurable regression methods (difference, ratio, log-ratio, quantile-mapping).
Gap-filling validation (gf_valid) — Leave-one-out cross-validation of the gap-filling procedure, producing accuracy statistics and per-station diagnostic reports.
Homogenisation (hg) — Detects and corrects systematic shifts (inhomogeneities) in the station series using the Standard Normal Homogeneity Test (SNHT; Alexandersson 1986), applied recursively in daily and monthly modes.
Climatologies (cl) — Computes per-station daily climatological statistics (mean, standard deviation, 5th/95th percentiles, extremes) using a 31-day centred moving window.
Gridding (gr) — Creates daily regular grids by universal kriging with elevation (alt) and distance-to-sea (dis) as covariates, covering mainland Spain / Balearics and the Canary Islands separately.
Gridding validation (gr_valid) — Leave-one-out cross-validation of the spatial interpolation.
Grid comparison (gr_comp) — Compares the produced grids against reference gridded products (available for tmax and tmin only).

Outputs are daily NetCDF grids and per-stage .rds R objects containing all data, metadata, and configuration options.

Repository Structure

data_flow/
├── R/
│   ├── main.R               ← Entry point; orchestrates the full pipeline
│   ├── functions.R          ← All functions and R5 class definitions
│   ├── config.yml           ← Per-variable configuration options
│   ├── pp_report.Rmd        ← Pre-processing summary report template
│   ├── gf_report.Rmd        ← Gap-filling summary report template
│   ├── gf_validation.Rmd    ← Gap-filling validation report template
│   ├── hg_report.Rmd        ← Homogenisation summary report template
│   ├── gr_report.Rmd        ← Gridding summary report template
│   ├── gr_validation.Rmd    ← Gridding validation report template
│   └── gr_comparison.Rmd    ← Grid comparison report template
├── data_raw/
│   ├── grd_pen.rda          ← Support grid: mainland Spain + Balearics (UTM 30N)
│   └── grd_can.rda          ← Support grid: Canary Islands (UTM 28N)
└── data/                    ← Output directory (created at runtime)
    ├── pp/                  ← Pre-processed objects
    ├── gf/                  ← Gap-filled objects
    ├── hg/                  ← Homogenised objects
    ├── cl/                  ← Climatology objects
    └── gr/                  ← Daily grids and NetCDF files

The pipeline is driven by a single script (R/main.R) that reads config.yml and sources all functions from functions.R. Each processing stage produces an .rds file (an R5 reference-class object) that is the input for the next stage.

Data modes

The pipeline can operate in two modes:

Database mode (default): data are read from and written to a PostgreSQL database (aemet on dana-sc-database). Quality-controlled input data are read from the quality_control schema; intermediate results are written to the data_flow schema. This is the intended production mode.

Local mode (flag -l / --local): data are read from and written to local .rds and .RData files. In local mode, the QC input files (.RData files produced by quality_control) must be present in the path specified by config$dir$qc_dir. This mode is useful for development, offline work, and reproducing results without database access.

Prerequisites

R ≥ 4.2
R packages (installed automatically via pacman): tidyverse, argparser, config, chron, Rfast, snowfall, future, terra, sf, rnaturalearth, gstat, ncdf4, abind, stringi, RPostgres, dplyr, automap, FNN
Pandoc (for rendering R Markdown reports; ships with RStudio)
The output of the quality_control pipeline, available either in the database or locally

Optionally, set the Pandoc path in the main.R preamble to match your installation:

Sys.setenv(RSTUDIO_PANDOC = "/Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools")

Installation

git clone https://github.com/PTI-Clima/data_flow.git
cd data_flow

Open data_flow.Rproj in RStudio, or set the working directory manually:

setwd("path/to/data_flow")

R package dependencies are installed automatically when main.R runs (via pacman::p_load).

Usage

Run from a terminal:

Rscript R/main.R <var> [options]

Get the full help message:

Rscript R/main.R --help

usage: main.R [--] [--help] [--trial] [--local] [--no_global_rep]
              [--no_indiv_rep] [--opts OPTS] [--procs PROCS]
              [--verbosity VERBOSITY] [--ncores NCORES] var

positional arguments:
  var              variable to process: one of tmax, tmin, trange, pr, hr,
                   ssrd, ws (and others as added to config.yml)

flags:
  -h, --help       show this help message and exit
  -t, --trial      enable trial mode (subset of data for testing)
  -l, --local      enable local mode (read/write local files, no database)
  -n, --no_global_rep   do not produce global HTML report
      --no_indiv_rep    do not produce per-station HTML reports

optional arguments:
  -p, --procs      processes to run: one or more of pp, gf, gf_valid, hg,
                   cl, gr, gr_valid, gr_comp (default: all, in order)
  -v, --verbosity  verbosity level: 0 = silent, >0 = status messages (default: 1)
  -c, --ncores     number of CPU cores (0 = all available; default: 1)

Examples

Run the full pipeline for maximum temperature, using the database:

Rscript R/main.R tmax

Run only gap-filling and homogenisation in local mode:

Rscript R/main.R pr -l -p gf hg

Run a quick trial with only the first 100 stations and no reports:

Rscript R/main.R tmax -t -l -n --no_indiv_rep

Run for all configured variables in sequence (shell script):

#!/bin/bash
for var in tmax tmin trange pr hr ssrd ws; do
    Rscript R/main.R $var -l
done

Configuration

All per-variable settings are in R/config.yml. There is a default: section with settings shared across variables; individual variable sections override these defaults. The main fields are:

Field	Description
`dir$qc_dir`	Path to quality-control output (input to `pp`)
`dir$pp_dir`	Output directory for pre-processed objects
`dir$gf_dir`	Output directory for gap-filled objects
`dir$hg_dir`	Output directory for homogenised objects
`dir$cl_dir`	Output directory for climatology objects
`dir$gr_dir`	Output directory for gridded data
`period_analysis$start/end`	Analysis period (currently 1961-01-01 to 2025-10-31)
`var$name`	Internal variable identifier (e.g., `"tmax"`)
`var$AEMET_name`	Raw AEMET variable code(s) in the QC data
`var$scale`	`"relative"` (e.g., temperature) or `"absolute"` (e.g., precipitation)
`var$unit`	Physical unit string
`var$factor`	Scale factor applied to raw AEMET values
`var$range`	Physical plausibility range (lower, upper)
`var$outlier_tolerance`	Percentage above/below climatological extremes to tolerate
`cand$min_years$active`	Minimum years required for an active station to be a candidate
`cand$min_years$inactive`	Minimum years required for an inactive station
`cand$exclude_auto`	Logical: exclude automatic stations from candidates
`aux$min_years`	Minimum years required for an auxiliary station
`aux$distance`	Distance metric for auxiliary station selection: `"correlation"` or `"euclidean"`
`aux$max_dist_km`	Maximum allowed distance (km) for an auxiliary station
`gf$method`	Gap-filling method: `"diff"`, `"ratio"`, `"logratio"`, `"direct"`, or `"qmap"`
`gf$seasonal`	Logical: use seasonally varying coefficients
`hg$correction`	Correction type: `"daily"`, `"monthly"`, `"annual"`, or `"none"`
`hg$method`	Homogenisation method: `"diff"` or `"ratio"`

Supported variables

Config key	Description	AEMET source variable(s)	Unit
`tmax`	Maximum daily temperature	TMAX	°C
`tmin`	Minimum daily temperature	TMIN	°C
`trange`	Daily thermal range (Tmax − Tmin)	TMAX, TMIN	°C
`pr`	Daily total precipitation	P	mm day⁻¹
`hr`	Mean daily relative humidity (4 synoptic obs.)	HU00, HU07, HU13, HU18	%
`ssrd`	Daily total global radiation	RGLODIA, TOTSOL	kJ m⁻² day⁻¹
`ws`	Mean daily wind speed (4 synoptic obs.)	VEL_00, VEL_07, VEL_13, VEL_18	km h⁻¹

Additional variables can be added by creating new sections in config.yml following the same structure.

Output

Each pipeline stage writes results to its configured output directory:

Stage	Output	Description
`pp`	`<pp_dir>/<var>.rds`	Pre-processed R5 object
`pp`	`<pp_dir>/<var>.html`	Summary report
`gf`	`<gf_dir>/<var>.rds`	Gap-filled R5 object
`gf`	`<gf_dir>/<var>.html`	Summary report
`gf`	`<gf_dir>/<var>/<stn>.html`	Per-station reports
`hg`	`<hg_dir>/<var>.rds`	Homogenised R5 object
`hg`	`<hg_dir>/<var>.html`	Summary report
`gr`	`<gr_dir>/<var>_pen.nc`	Daily grids, mainland (NetCDF)
`gr`	`<gr_dir>/<var>_can.nc`	Daily grids, Canary Islands (NetCDF)

The .rds objects are R5 reference-class instances containing all data, metadata, configuration, and embedded methods, so they are fully self-contained and reproducible.

Relationship to `quality_control`

This repository is the direct downstream successor of quality_control. The QC pipeline produces cleaned station data (as .RData files or database tables in the quality_control schema); data_flow reads this output as its starting point. Neither pipeline can substitute for the other:

quality_control must be run first to remove erroneous observations.
data_flow then fills gaps, removes inhomogeneities, and interpolates the cleaned series onto a regular grid.

In production, both pipelines write to and read from a shared PostgreSQL database. In local mode, the QC .RData files must be placed in the directory specified by config$dir$qc_dir.

Further Documentation

See docs/full_documentation.md for detailed descriptions of each pipeline stage, all function signatures, output data structures, and suggestions for further development.

License

GPL-3 or later. See http://www.gnu.org/licenses/gpl.txt.

Authors

Santiago Beguería, LCSC-CSIC (https://lcsc.csic.es).

Name		Name	Last commit message	Last commit date
Latest commit History 611 Commits
.github/workflows		.github/workflows
R		R
data_raw		data_raw
docs		docs
man/figures		man/figures
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data_flow

Purpose

Repository Structure

Data modes

Prerequisites

Installation

Usage

Examples

Configuration

Supported variables

Output

Relationship to `quality_control`

Further Documentation

License

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

data_flow

Purpose

Repository Structure

Data modes

Prerequisites

Installation

Usage

Examples

Configuration

Supported variables

Output

Relationship to quality_control

Further Documentation

License

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Relationship to `quality_control`

Packages