Gentle observability for long training runs.
Emry watches your training run the way you'd want a good colleague to: quietly,
without ever getting in the way. A training loop calls run.emit(); metrics
flow through a lock-free ring into an event-sourced engine that persists an
append-only log and serves a live dashboard. No accounts, no phone-home — just
your metrics, on your machine, in a file you can read.
The terminal dashboard (emry watch) — live loss curve with a dashed
amber baseline overlay for run comparison, phase bands, checkpoint markers,
metric cards, and alerts. Full parity with the web dashboard.
The self-hosted web dashboard (emry web) — live chart with a dashed
baseline overlay for run comparison, phase bands, and checkpoint markers. No
CDN; works air-gapped.
- Stays out of the way.
emit()targets well under 10 µs amortized (tens of nanoseconds in our benchmarks) and never blocks the training thread — every queue is bounded and drops-and-counts under load, so observability can never harm the run. - Event-sourced. An append-only
events.jsonlis the audit trail; a widemetrics.jsonlis plain JSONL you can read withjq, pandas, or anything. - Observe live or after the fact. A terminal dashboard and a self-hosted web dashboard (no CDN — air-gap friendly) at full parity — live chart, phase bands, checkpoint markers, and a baseline overlay to compare against a prior run — or just tail the files.
- Built for clusters. Embedded, sidecar, or file modes; auto-detects SSH/SLURM. The training process survives an engine crash.
pip install emryYour training loop calls emry.run(...) and run.emit(...). That's it:
import emry
with emry.run("llama-sft", config={"lr": 2e-5}, metrics=["loss", "lr"]) as run:
for step in run.steps(10_000):
loss = train_step()
run.emit(loss=loss, lr=scheduler.get_last_lr()[0])run.steps(n) yields steps and advances Emry's step counter for you; emit()
takes any metrics as keyword arguments. Mark phases with
run.phase = emry.Phase.EVAL, and iterate epochs with run.epochs(n) to track
the epoch automatically. Values are duck-typed — tensors and numpy scalars are
coerced, so you can pass loss directly without .item(). When an NVIDIA GPU is
present, Emry samples nvidia-smi automatically and charts GPU utilization,
memory, and temperature alongside your metrics (gpu=False to disable). Pass
alert_webhook= (or set EMRY_ALERT_WEBHOOK) to get a Slack/Discord ping the
moment a metric goes NaN/Inf.
By default Emry writes a run directory under ./logs/ and, when attached to a
TTY, brings up the live terminal dashboard. Set EMRY_MODE (embedded |
sidecar | file) to control how it runs, or observe any run after the fact
with the commands below.
emry runs # list runs under ./logs
emry watch ./logs/llama-sft_… # live terminal dashboard
emry web --run-dir ./logs/… # live web dashboard at http://127.0.0.1:8787
emry watch ./logs/new --compare ./logs/old # overlay a prior run as a baseline (TUI or web)
emry compare run_a/ run_b/ # final metrics side by side
emry export csv --run-dir ./logs/… --output history.csvOn a cluster, run the engine as a sidecar so observability outlives the training process — see the SLURM runbook.
- SLURM / sidecar runbook — login-node-observe + on-node sidecar engine.
- Migration guide — the
metrics.jsonlschema and importing history from other loggers.
- Rust 1.88+ (
rust-toolchain.tomlpins the toolchain) llvm-tools-previewfor coverage:rustup component add llvm-tools-previewcargo-llvm-cov:cargo install cargo-llvm-cov- Python 3.10+
# Full local CI (fmt, clippy, test, ≥90% coverage)
./scripts/pre-commit-rust.sh
# Coverage only
./scripts/check-coverage.sh
# Python tests
pip install -e ".[dev]"
pytest
# Build the native extension locally (maturin)
pip install maturin && maturin develop
# Run the demos
cargo run -p emry-tui --example tui_demo
cargo run -p emry-web --example web_demo # http://127.0.0.1:8788pip install pre-commit
pre-commit installHooks run: trailing whitespace, YAML/TOML checks, then ./scripts/pre-commit-rust.sh
(fmt + clippy + test + 90% line coverage gate).
| Check | Threshold |
|---|---|
cargo clippy |
-D warnings (pedantic) |
| Rust line coverage | ≥ 90% (workspace) |
| Python line coverage | ≥ 90% (pytest --cov-fail-under=90) |
Apache License 2.0 — see LICENSE.

