A complete automated pipeline for processing quality-controlled meteorological station data into gap-filled, homogenised time series and daily gridded datasets. It is the second stage of the PTI+ Clima climate data workflow, and must be run after the quality_control pipeline has been completed.
The pipeline takes as input the quality-controlled (QC) station data produced by the quality_control repository and applies a chain of processing steps:
- Pre-processing (
pp) — Crops to the analysis period, aggregates multi-variable measurements, classifies stations as candidates or auxiliaries, and computes inter-station distance and correlation matrices. - Gap-filling (
gf) — Estimates missing values in candidate station series using information from neighbouring stations and configurable regression methods (difference, ratio, log-ratio, quantile-mapping). - Gap-filling validation (
gf_valid) — Leave-one-out cross-validation of the gap-filling procedure, producing accuracy statistics and per-station diagnostic reports. - Homogenisation (
hg) — Detects and corrects systematic shifts (inhomogeneities) in the station series using the Standard Normal Homogeneity Test (SNHT; Alexandersson 1986), applied recursively in daily and monthly modes. - Climatologies (
cl) — Computes per-station daily climatological statistics (mean, standard deviation, 5th/95th percentiles, extremes) using a 31-day centred moving window. - Gridding (
gr) — Creates daily regular grids by universal kriging with elevation (alt) and distance-to-sea (dis) as covariates, covering mainland Spain / Balearics and the Canary Islands separately. - Gridding validation (
gr_valid) — Leave-one-out cross-validation of the spatial interpolation. - Grid comparison (
gr_comp) — Compares the produced grids against reference gridded products (available fortmaxandtminonly).
Outputs are daily NetCDF grids and per-stage .rds R objects containing all data, metadata, and configuration options.
data_flow/
├── R/
│ ├── main.R ← Entry point; orchestrates the full pipeline
│ ├── functions.R ← All functions and R5 class definitions
│ ├── config.yml ← Per-variable configuration options
│ ├── pp_report.Rmd ← Pre-processing summary report template
│ ├── gf_report.Rmd ← Gap-filling summary report template
│ ├── gf_validation.Rmd ← Gap-filling validation report template
│ ├── hg_report.Rmd ← Homogenisation summary report template
│ ├── gr_report.Rmd ← Gridding summary report template
│ ├── gr_validation.Rmd ← Gridding validation report template
│ └── gr_comparison.Rmd ← Grid comparison report template
├── data_raw/
│ ├── grd_pen.rda ← Support grid: mainland Spain + Balearics (UTM 30N)
│ └── grd_can.rda ← Support grid: Canary Islands (UTM 28N)
└── data/ ← Output directory (created at runtime)
├── pp/ ← Pre-processed objects
├── gf/ ← Gap-filled objects
├── hg/ ← Homogenised objects
├── cl/ ← Climatology objects
└── gr/ ← Daily grids and NetCDF files
The pipeline is driven by a single script (R/main.R) that reads config.yml and sources all functions from functions.R. Each processing stage produces an .rds file (an R5 reference-class object) that is the input for the next stage.
The pipeline can operate in two modes:
Database mode (default): data are read from and written to a PostgreSQL database (aemet on dana-sc-database). Quality-controlled input data are read from the quality_control schema; intermediate results are written to the data_flow schema. This is the intended production mode.
Local mode (flag -l / --local): data are read from and written to local .rds and .RData files. In local mode, the QC input files (.RData files produced by quality_control) must be present in the path specified by config$dir$qc_dir. This mode is useful for development, offline work, and reproducing results without database access.
- R ≥ 4.2
- R packages (installed automatically via
pacman):tidyverse,argparser,config,chron,Rfast,snowfall,future,terra,sf,rnaturalearth,gstat,ncdf4,abind,stringi,RPostgres,dplyr,automap,FNN - Pandoc (for rendering R Markdown reports; ships with RStudio)
- The output of the
quality_controlpipeline, available either in the database or locally
Optionally, set the Pandoc path in the main.R preamble to match your installation:
Sys.setenv(RSTUDIO_PANDOC = "/Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools")git clone https://github.com/PTI-Clima/data_flow.git
cd data_flowOpen data_flow.Rproj in RStudio, or set the working directory manually:
setwd("path/to/data_flow")R package dependencies are installed automatically when main.R runs (via pacman::p_load).
Run from a terminal:
Rscript R/main.R <var> [options]Get the full help message:
Rscript R/main.R --helpusage: main.R [--] [--help] [--trial] [--local] [--no_global_rep]
[--no_indiv_rep] [--opts OPTS] [--procs PROCS]
[--verbosity VERBOSITY] [--ncores NCORES] var
positional arguments:
var variable to process: one of tmax, tmin, trange, pr, hr,
ssrd, ws (and others as added to config.yml)
flags:
-h, --help show this help message and exit
-t, --trial enable trial mode (subset of data for testing)
-l, --local enable local mode (read/write local files, no database)
-n, --no_global_rep do not produce global HTML report
--no_indiv_rep do not produce per-station HTML reports
optional arguments:
-p, --procs processes to run: one or more of pp, gf, gf_valid, hg,
cl, gr, gr_valid, gr_comp (default: all, in order)
-v, --verbosity verbosity level: 0 = silent, >0 = status messages (default: 1)
-c, --ncores number of CPU cores (0 = all available; default: 1)
Run the full pipeline for maximum temperature, using the database:
Rscript R/main.R tmaxRun only gap-filling and homogenisation in local mode:
Rscript R/main.R pr -l -p gf hgRun a quick trial with only the first 100 stations and no reports:
Rscript R/main.R tmax -t -l -n --no_indiv_repRun for all configured variables in sequence (shell script):
#!/bin/bash
for var in tmax tmin trange pr hr ssrd ws; do
Rscript R/main.R $var -l
doneAll per-variable settings are in R/config.yml. There is a default: section with settings shared across variables; individual variable sections override these defaults. The main fields are:
| Field | Description |
|---|---|
dir$qc_dir |
Path to quality-control output (input to pp) |
dir$pp_dir |
Output directory for pre-processed objects |
dir$gf_dir |
Output directory for gap-filled objects |
dir$hg_dir |
Output directory for homogenised objects |
dir$cl_dir |
Output directory for climatology objects |
dir$gr_dir |
Output directory for gridded data |
period_analysis$start/end |
Analysis period (currently 1961-01-01 to 2025-10-31) |
var$name |
Internal variable identifier (e.g., "tmax") |
var$AEMET_name |
Raw AEMET variable code(s) in the QC data |
var$scale |
"relative" (e.g., temperature) or "absolute" (e.g., precipitation) |
var$unit |
Physical unit string |
var$factor |
Scale factor applied to raw AEMET values |
var$range |
Physical plausibility range (lower, upper) |
var$outlier_tolerance |
Percentage above/below climatological extremes to tolerate |
cand$min_years$active |
Minimum years required for an active station to be a candidate |
cand$min_years$inactive |
Minimum years required for an inactive station |
cand$exclude_auto |
Logical: exclude automatic stations from candidates |
aux$min_years |
Minimum years required for an auxiliary station |
aux$distance |
Distance metric for auxiliary station selection: "correlation" or "euclidean" |
aux$max_dist_km |
Maximum allowed distance (km) for an auxiliary station |
gf$method |
Gap-filling method: "diff", "ratio", "logratio", "direct", or "qmap" |
gf$seasonal |
Logical: use seasonally varying coefficients |
hg$correction |
Correction type: "daily", "monthly", "annual", or "none" |
hg$method |
Homogenisation method: "diff" or "ratio" |
| Config key | Description | AEMET source variable(s) | Unit |
|---|---|---|---|
tmax |
Maximum daily temperature | TMAX | °C |
tmin |
Minimum daily temperature | TMIN | °C |
trange |
Daily thermal range (Tmax − Tmin) | TMAX, TMIN | °C |
pr |
Daily total precipitation | P | mm day⁻¹ |
hr |
Mean daily relative humidity (4 synoptic obs.) | HU00, HU07, HU13, HU18 | % |
ssrd |
Daily total global radiation | RGLODIA, TOTSOL | kJ m⁻² day⁻¹ |
ws |
Mean daily wind speed (4 synoptic obs.) | VEL_00, VEL_07, VEL_13, VEL_18 | km h⁻¹ |
Additional variables can be added by creating new sections in config.yml following the same structure.
Each pipeline stage writes results to its configured output directory:
| Stage | Output | Description |
|---|---|---|
pp |
<pp_dir>/<var>.rds |
Pre-processed R5 object |
pp |
<pp_dir>/<var>.html |
Summary report |
gf |
<gf_dir>/<var>.rds |
Gap-filled R5 object |
gf |
<gf_dir>/<var>.html |
Summary report |
gf |
<gf_dir>/<var>/<stn>.html |
Per-station reports |
hg |
<hg_dir>/<var>.rds |
Homogenised R5 object |
hg |
<hg_dir>/<var>.html |
Summary report |
gr |
<gr_dir>/<var>_pen.nc |
Daily grids, mainland (NetCDF) |
gr |
<gr_dir>/<var>_can.nc |
Daily grids, Canary Islands (NetCDF) |
The .rds objects are R5 reference-class instances containing all data, metadata, configuration, and embedded methods, so they are fully self-contained and reproducible.
This repository is the direct downstream successor of quality_control. The QC pipeline produces cleaned station data (as .RData files or database tables in the quality_control schema); data_flow reads this output as its starting point. Neither pipeline can substitute for the other:
quality_controlmust be run first to remove erroneous observations.data_flowthen fills gaps, removes inhomogeneities, and interpolates the cleaned series onto a regular grid.
In production, both pipelines write to and read from a shared PostgreSQL database. In local mode, the QC .RData files must be placed in the directory specified by config$dir$qc_dir.
See docs/full_documentation.md for detailed descriptions of each pipeline stage, all function signatures, output data structures, and suggestions for further development.
GPL-3 or later. See http://www.gnu.org/licenses/gpl.txt.
Santiago Beguería, LCSC-CSIC (https://lcsc.csic.es).

