TrainCheck: Invariant Checking for AI Training

TrainCheck catches silent training bugs by learning what a healthy run does, then checking a new run against those learned invariants. It works by tracing PyTorch API calls and model state changes, so you can inspect training behavior before a loss curve or final metric tells you something went wrong.

Install

Install TrainCheck in the same Python environment that runs your training script:

pip3 install traincheck

For CUDA, conda, and source-install details, see the Installation Guide.

Use TrainCheck

TrainCheck has four main steps.

1. Collect a Reference Trace

Run traincheck-collect on a known-good training script. This should be a short run that covers the training behavior you want TrainCheck to learn.

traincheck-collect \
  --pyscript reference.py \
  --models-to-track model \
  --output-dir reference_trace

2. Infer Invariants

Turn the reference trace into invariants:

traincheck-infer -f reference_trace -o invariants.json

3. Collect a Target Trace

Run the target training script with the inferred invariants. Passing --invariants lets TrainCheck trace only the APIs and variables needed for those checks.

traincheck-collect \
  --pyscript target.py \
  --models-to-track model \
  --invariants invariants.json \
  --output-dir target_trace

For long target runs, trace fewer steps:

traincheck-collect \
  --pyscript target.py \
  --models-to-track model \
  --invariants invariants.json \
  --sampling-interval 10 \
  --warm-up-steps 10 \
  --output-dir target_trace

4. Check the Target Run

For live checking, start traincheck-onlinecheck while the target run is writing traces:

traincheck-onlinecheck -f target_trace -i invariants.json

The easier offline path is to wait for trace collection to finish, then run:

traincheck-check -f target_trace -i invariants.json

Both checkers write a results directory with failure logs and a report.html summary.

Learn More

Use TrainCheck explains the full workflow and output files.
5-Minute Tutorial walks through a real silent training issue.
Installation Guide covers environment setup.
Technical Documentation describes invariants, trace representation, and implementation details.

Status

TrainCheck is under active development. Please join our Discord server, file a GitHub issue, or email traincheck@umich.edu.

Contributing

We welcome contributions. See Contributing to TrainCheck for setup and contribution guidance.

License

TrainCheck is licensed under the Apache License 2.0.

Citation

If TrainCheck is relevant to your work, please cite our paper:

@inproceedings{TrainCheckOSDI2025,
  author = {Jiang, Yuxuan and Zhou, Ziming and Xu, Boyu and Liu, Beijie and Xu, Runhui and Huang, Peng},
  title = {Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks},
  booktitle = {Proceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation},
  series = {OSDI '25},
  month = {July},
  year = {2025},
  address = {Boston, MA, USA},
  publisher = {USENIX Association},
}

Artifact Evaluation

OSDI AE members should use the TrainCheck AE Guide.

Name		Name	Last commit message	Last commit date
Latest commit History 1,298 Commits
.github		.github
docs		docs
tests		tests
traincheck		traincheck
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
NOTICE.txt		NOTICE.txt
README.md		README.md
ROADMAP.md		ROADMAP.md
SUGGESTION.md		SUGGESTION.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TrainCheck: Invariant Checking for AI Training

Install

Use TrainCheck

1. Collect a Reference Trace

2. Infer Invariants

3. Collect a Target Trace

4. Check the Target Run

Learn More

Status

Contributing

License

Citation

Artifact Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

TrainCheck: Invariant Checking for AI Training

Install

Use TrainCheck

1. Collect a Reference Trace

2. Infer Invariants

3. Collect a Target Trace

4. Check the Target Run

Learn More

Status

Contributing

License

Citation

Artifact Evaluation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages