TrainCheck catches silent training bugs by learning what a healthy run does, then checking a new run against those learned invariants. It works by tracing PyTorch API calls and model state changes, so you can inspect training behavior before a loss curve or final metric tells you something went wrong.
Install TrainCheck in the same Python environment that runs your training script:
pip3 install traincheckFor CUDA, conda, and source-install details, see the Installation Guide.
TrainCheck has four main steps.
Run traincheck-collect on a known-good training script. This should be a short run that covers the training behavior you want TrainCheck to learn.
traincheck-collect \
--pyscript reference.py \
--models-to-track model \
--output-dir reference_traceTurn the reference trace into invariants:
traincheck-infer -f reference_trace -o invariants.jsonRun the target training script with the inferred invariants. Passing --invariants lets TrainCheck trace only the APIs and variables needed for those checks.
traincheck-collect \
--pyscript target.py \
--models-to-track model \
--invariants invariants.json \
--output-dir target_traceFor long target runs, trace fewer steps:
traincheck-collect \
--pyscript target.py \
--models-to-track model \
--invariants invariants.json \
--sampling-interval 10 \
--warm-up-steps 10 \
--output-dir target_traceFor live checking, start traincheck-onlinecheck while the target run is writing traces:
traincheck-onlinecheck -f target_trace -i invariants.jsonThe easier offline path is to wait for trace collection to finish, then run:
traincheck-check -f target_trace -i invariants.jsonBoth checkers write a results directory with failure logs and a report.html summary.
- Use TrainCheck explains the full workflow and output files.
- 5-Minute Tutorial walks through a real silent training issue.
- Installation Guide covers environment setup.
- Technical Documentation describes invariants, trace representation, and implementation details.
TrainCheck is under active development. Please join our Discord server, file a GitHub issue, or email traincheck@umich.edu.
We welcome contributions. See Contributing to TrainCheck for setup and contribution guidance.
TrainCheck is licensed under the Apache License 2.0.
If TrainCheck is relevant to your work, please cite our paper:
@inproceedings{TrainCheckOSDI2025,
author = {Jiang, Yuxuan and Zhou, Ziming and Xu, Boyu and Liu, Beijie and Xu, Runhui and Huang, Peng},
title = {Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks},
booktitle = {Proceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation},
series = {OSDI '25},
month = {July},
year = {2025},
address = {Boston, MA, USA},
publisher = {USENIX Association},
}OSDI AE members should use the TrainCheck AE Guide.