Skip to content

OrderLab/TrainCheck

Repository files navigation

TrainCheck logo

TrainCheck: Invariant Checking for AI Training

Chat on Discord Ask DeepWiki

TrainCheck catches silent training bugs by learning what a healthy run does, then checking a new run against those learned invariants. It works by tracing PyTorch API calls and model state changes, so you can inspect training behavior before a loss curve or final metric tells you something went wrong.

Install

Install TrainCheck in the same Python environment that runs your training script:

pip3 install traincheck

For CUDA, conda, and source-install details, see the Installation Guide.

Use TrainCheck

TrainCheck has four main steps.

1. Collect a Reference Trace

Run traincheck-collect on a known-good training script. This should be a short run that covers the training behavior you want TrainCheck to learn.

traincheck-collect \
  --pyscript reference.py \
  --models-to-track model \
  --output-dir reference_trace

2. Infer Invariants

Turn the reference trace into invariants:

traincheck-infer -f reference_trace -o invariants.json

3. Collect a Target Trace

Run the target training script with the inferred invariants. Passing --invariants lets TrainCheck trace only the APIs and variables needed for those checks.

traincheck-collect \
  --pyscript target.py \
  --models-to-track model \
  --invariants invariants.json \
  --output-dir target_trace

For long target runs, trace fewer steps:

traincheck-collect \
  --pyscript target.py \
  --models-to-track model \
  --invariants invariants.json \
  --sampling-interval 10 \
  --warm-up-steps 10 \
  --output-dir target_trace

4. Check the Target Run

For live checking, start traincheck-onlinecheck while the target run is writing traces:

traincheck-onlinecheck -f target_trace -i invariants.json

The easier offline path is to wait for trace collection to finish, then run:

traincheck-check -f target_trace -i invariants.json

Both checkers write a results directory with failure logs and a report.html summary.

Learn More

Status

TrainCheck is under active development. Please join our Discord server, file a GitHub issue, or email traincheck@umich.edu.

Contributing

We welcome contributions. See Contributing to TrainCheck for setup and contribution guidance.

License

TrainCheck is licensed under the Apache License 2.0.

Citation

If TrainCheck is relevant to your work, please cite our paper:

@inproceedings{TrainCheckOSDI2025,
  author = {Jiang, Yuxuan and Zhou, Ziming and Xu, Boyu and Liu, Beijie and Xu, Runhui and Huang, Peng},
  title = {Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks},
  booktitle = {Proceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation},
  series = {OSDI '25},
  month = {July},
  year = {2025},
  address = {Boston, MA, USA},
  publisher = {USENIX Association},
}

Artifact Evaluation

OSDI AE members should use the TrainCheck AE Guide.

About

An Observability Framework for AI Training

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages