PhyTS is a benchmark suite of precision scientific time series datasets for machine learning, spanning experiments in gravitational-wave detection, dark matter searches, neutrino mass determination, and stellar variability detection. Despite their diverse scientific goals, these domains share a common challenge: recovering weak, structured signals and estimating underlying physical parameters from noise-dominated measurements.
Unlike standard sequence modeling benchmarks such as audio and speech, these data exhibit non-Gaussian and nonstationary noise, long-range temporal correlations, detector-specific systematics, irregular sampling, and signals that are sparse, weak, or only partially modeled. As a result, they provide a challenging testbed for evaluating whether modern AI methods can support downstream scientific inference.
Existing time series benchmarks inadequately prepare models for scientific applications due to three critical limitations: (1) unrealistic noise assumptions — most benchmarks use Gaussian noise rather than the complex, frequency-dependent backgrounds found in real detectors; (2) missing physics constraints — standard metrics ignore that scientific applications require interpretable confidence estimates and respect for physical laws; and (3) simplified temporal structure — scientific signals often exhibit multi-scale dependencies and rare transient events that are poorly represented in current datasets.
PhyTS addresses these gaps across four dimensions:
- Realistic complexity: physics domains spanning 12 orders of magnitude in sampling rate with authentic detector noise, non-stationary backgrounds, and scientifically meaningful signal-to-noise ratios.
- Physics-informed evaluation: task formulations that reflect real experimental constraints, including parameter inference under uncertainty and noise reduction across orders of magnitude of signal frequency.
- Systematic baselines: comprehensive comparisons of supervised models and six zero-shot foundation models, revealing fundamental limitations of current architectures on scientific data.
- Grounding in scientific measurement: each predicted value propagates transparently into a downstream physics result, enabling evaluation by scientific impact.
Paper: PhyTS: A Benchmark for Scientific Time Series (NeurIPS 2026)
Data: PhyTS-team/PhyTS-bench on Hugging Face
| Experiment | Domain | Task | Sampling rate | Seq. length | SNR |
|---|---|---|---|---|---|
| LIGO | Gravitational waves | Chirp-mass regression | 256 Hz | 1,024 (4 s) | 5–50 |
| ABRACADABRA | Axion dark matter | Time-series denoising | 10 MHz | 100,000 (1 s) | 0.02–200 |
| TESS | Stellar variability | 8-class classification | 0.56–1.67 mHz | 1,300–3,670 | 0.3–140 |
| Project 8 | Neutrino mass | Energy regression | 403 MHz | 24,576 (61 μs) | 3–25 |
Detailed dataset descriptions and preprocessing steps are in the paper (Section 3).
Three environments are needed depending on which models you run. All main environments are managed by uv.
# PyTorch models (S4D, CNN, RNN, MLP, Conv-AE) and data download — covers all four datasets
make env
source .venv/bin/activate
# LinOSS (JAX/Equinox) — same environment with JAX + CUDA 12 added
make env-jax
source .venv/bin/activate
# Foundation models (MOMENT, Chronos, TimesFM, Time-MoE, MOIRAI, Granite TTM)
# Uses a separate Python 3.10 environment due to conflicting dependencies
make env-fm
source benchmarks/foundation/.venv/bin/activateCUDA 13 (driver ≥ 580): replace env-jax with uv sync --extra jax --extra cu13.
All datasets are on Hugging Face at PhyTS-team/PhyTS-bench.
For pipeline verification (NeurIPS reproducibility), sample files covering all four domains can be downloaded in one step:
python data/download.py --sampleTo download a full dataset:
python data/download.py --domain tess # 194 MB
python data/download.py --domain ligo # ~157 GB
python data/download.py --domain project8 # ~44 GB
python data/download.py --domain tidmad # ~163 GB; then run:
python data/TIDMAD/preprocess_tidmad.py \
--data_dir data/TIDMAD/original --out_dir data/TIDMAD/preprocessedSee data/download.py --help for options.
All training goes through main.py (LightningCLI). Pick any config from configs/:
python main.py fit --config <path/to/config.yaml>One example per domain:
# LIGO — chirp-mass regression, S4D
python main.py fit --config configs/LIGO/train_ligo_s4d_gaussnll_regression.yaml
# ABRACADABRA — denoising, LinOSS (requires env-jax)
python main.py fit --config configs/TIDMAD/train_tidmad_linoss_denoising.yaml
# TESS — variability classification, S4D
python main.py fit --config configs/TESS/train_tess_s4d_classification.yaml
# Project 8 — energy regression, S4D
python main.py fit --config configs/Project8/train_project8_s4d_regression_energy_gaussiannll.yamlAny config value can be overridden on the command line:
python main.py fit --config configs/LIGO/train_ligo_conv1d_gaussnll_regression.yaml \
--model.init_args.lr 1e-4 \
--data.init_args.batch_size 64The full set of configs is in configs/.
Each domain has a pipeline script that trains all models and runs evaluation end-to-end:
bash benchmarks/LIGO/run.sh
bash benchmarks/TIDMAD/run.sh
bash benchmarks/TESS/run.sh
bash benchmarks/Project8/run.shFor foundation models (zero-shot evaluation):
source benchmarks/foundation/.venv/bin/activate
python benchmarks/foundation/run_benchmark.py \
--models moment chronos timesfm moirai granite_ttm \
--tasks forecasting denoising embedding \
--mode zero_shotABRACADABRA denoising score — after training, evaluate all variants:
PYTHONPATH=src python benchmarks/TIDMAD/evaluate_all.pyNumbers from the paper (Table 2). Results are written to results/ after running the benchmark scripts. Foundation models evaluated zero-shot.
| Model | LIGO RMSE [M☉] | LIGO R² | TESS Bal. Acc. | TESS R² | TIDMAD score | P8 RMSE [eV] | P8 R² |
|---|---|---|---|---|---|---|---|
| Mean baseline | 0.271 | 0.000 | 0.125 | −0.017 | 1.00 | 28.83 | 0.000 |
| S4D | 0.254 | 0.125 | 0.887 | 0.665 | — | 15.68 | 0.704 |
| LinOSS | 0.259 | 0.081 | 0.843 | 0.612 | 1.30 | 20.88 | 0.476 |
| CNN | 0.280 | −0.068 | 0.851 | 0.617 | −0.11 | 20.11 | 0.514 |
| MOMENT | 0.284 | −0.096 | 0.828 | 0.263 | 0.46 | 25.22 | 0.236 |
| Chronos | 0.278 | −0.052 | 0.812 | 0.305 | −0.88 | 25.22 | 0.235 |
src/
models/ # S4D, LinOSS, CNN, RNN, MLP, Conv-AE, classical filter
tasks/ # LightningModules per domain (LIGO, TIDMAD, TESS, Project8)
dataloader/ # PyTorch DataModules per domain
functions/ # Loss functions, dropout, learning-rate schedules
configs/ # YAML training configs (one per model × domain × task)
benchmarks/
LIGO/ # run.sh + evaluation pipeline
TIDMAD/ # run.sh + evaluation pipeline
TESS/ # run.sh + evaluation pipeline
Project8/ # run.sh + evaluation pipeline
foundation/ # Foundation-model wrappers and benchmark runner
data/
LIGO/ # HDF5 strain files (gitignored; download from HuggingFace)
TIDMAD/ # Raw HDF5 + preprocess_tidmad.py
TESS/ # Parquet files (gitignored; bash benchmarks/TESS/setup_data.sh)
Project8/ # HDF5 files (gitignored; download from HuggingFace)
plots/ # Generated by benchmark scripts (gitignored)
results/ # Generated by benchmark scripts (gitignored)
main.py # LightningCLI entry point
pyproject.toml # Dependencies (uv)
Makefile # Environment setup targets
| Model | Class | Backend |
|---|---|---|
| S4D | models.s4d.S4Model |
PyTorch |
| LinOSS | models.linoss.LinOSS |
JAX / Equinox |
| 1D CNN (LIGO) | models.conv1d_regressor.ResNet1DRegressor |
PyTorch |
| 1D CNN (TESS) | models.conv.ConvClassifier |
PyTorch |
| Conv-AE (TIDMAD) | models.conv_ae.ConvAE |
PyTorch |
| 1D CNN (Project 8) | models.conv_regressor.Conv1DRegressor |
PyTorch |
Tasks are organized by domain; configs live in configs/<domain>/. Any model can be swapped into any compatible task by changing model.class_path in the YAML.
| Model | Reference |
|---|---|
| MOMENT | benchmarks/foundation/wrappers/moment_wrapper.py |
| Chronos | benchmarks/foundation/wrappers/chronos_wrapper.py |
| TimesFM | benchmarks/foundation/wrappers/timesfm_wrapper.py |
| Time-MoE | benchmarks/foundation/wrappers/timemoe_wrapper.py |
| MOIRAI | benchmarks/foundation/wrappers/moirai_wrapper.py |
| Granite TTM | benchmarks/foundation/wrappers/granite_ttm_wrapper.py |
Foundation models are evaluated zero-shot and with lightweight fine-tuning via benchmarks/foundation/run_benchmark.py. Each wrapper exposes a uniform interface (forecast, denoise, embed) so the same evaluators run across all models.
@inproceedings{phyts2026,
title = {PhyTS: A Benchmark for Scientific Time Series},
author = {...},
booktitle = {Advances in Neural Information Processing Systems},
year = {2026}
}Apache 2.0. The S4D implementation is derived from state-spaces/s4 (Apache 2.0); see NOTICE for full attribution. LinOSS adapted from tk-rusch/linoss.