The public scoreboard for loop engineering.
Fixed tasks. Fixed seeds. Observed LES. Submissions anyone can audit.
No hand-waved demos — bring an LSS spec, get a number, climb the leaderboard.
pip install "le-loopforge>=0.5.0" "le-loopctl>=0.5.0" loopbench loopgym
loopbench list
loopbench suite listRun your first score · Live leaderboard · Loop Playground · Leaderboard JSON · Suite overview
You submit a loop specification (LSS YAML). LoopBench:
- Runs it through LoopGym on fixed task instances
- Computes Success@k and LES_obs across eight categories
- Validates your
results.jsonagainst a published schema - Ranks you on the public leaderboard — generalist (grand composite) is the primary rank
loopbench run --suite suite-repair --spec your-loop.yaml --seeds 0,1,2,3,4 -o results.json
loopbench validate results.json
loopbench rank leaderboard/entries.json
loopbench rank leaderboard/entries.json --suite suite-repairOne LSS spec → four suite scores → one generalist rank on the public leaderboard. No hand-waved demos. No private benchmarks. Just numbers anyone can reproduce.
| What LoopBench gives you | Why it beats "we tried it internally" |
|---|---|
| Fixed tasks & seeds | Same inputs for every submission — apples to apples |
| LES_obs across 8 dimensions | Speed, cost, robustness, safety — not just pass/fail |
| Public leaderboard | Reputation you can link in a PR |
| Schema-valid results.json | No moving goalposts after the fact |
| Generalist rank | One number that rewards loops that work everywhere |
| Suite | Tasks | Stress area |
|---|---|---|
suite-repair |
5 | verify-driven repair + safety |
suite-agent |
5 | multi-agent coordination |
suite-knowledge |
4 | research + RAG |
suite-rigor |
5 | composition + HITL + memory |
flowchart LR
YOU["Your LSS spec"]
LB["LoopBench<br/>tasks · scoring · conformance"]
LG["LoopGym<br/>SimEnv execution"]
OUT["results.json → leaderboard"]
YOU --> LB
LB --> LG
LG --> LB
LB --> OUT
| Layer | Owns | Repo |
|---|---|---|
| Spec | LSS schema, LES formulas | Loop Core Engineering |
| Data | Trajectories (holdout v0.2) | LoopNet |
| Runtime | env.run_episode() |
LoopGym |
| Observability | LTF traces, iteration metrics | loop-observability |
| Measurement | Tasks, LES_obs, anti-gaming | LoopBench |
LoopBench defines and scores. LoopGym runs. Never the other way around.
New to the stack? Start with the LoopNet end-to-end tutorial.
19 micro-tasks feed 4 comparison suites. Primary leaderboard rank = generalist (mean of suite scores).
| Suite ID | Label | Micro-tasks |
|---|---|---|
suite-repair |
Repair & Verify | LB-CR-1, LB-REACT-1, LB-REFLEX-1, LB-OPT-1, LB-SAFE-1 |
suite-agent |
Multi-Agent | LB-MA-1, LB-CREW-1, LB-GRAPH-1, LB-TOT-1, LB-VOTE-1 |
suite-knowledge |
Research & RAG | LB-RS-1, LB-RAG-1, LB-BOOT-1, LB-AUTO-1 |
suite-rigor |
Composition & Safety | LB-COMP-1, LB-NEST-1, LB-SIM-1, LB-HITL-1, LB-MEM-1 |
loopbench suite list
loopbench run --suite suite-repair --spec your-loop.yaml --seeds 0,1,2,3,4 -o results.json
loopbench run --task LB-CR-1 --spec your-loop.yaml --seeds 0,1,2,3,4 -o results.jsonFull catalog in tasks/index.yaml and SUITE-OVERVIEW.md.
Live board (updated 2026-06-30) — full rankings
Generalist:
- Loop Engineering maintainer — LES 86.7
- Loop Engineering maintainer (MA-1) — LES 86.5
- Team Thorough — LES 86.4
Post your 60-minute reproduction report on the reproduction challenge after REPRODUCE.md.
One command: BEAT_LB-CR-1.md — target LES_obs ≥ 86.7 on LB-CR-1.
Also: BEAT_LB-RS-1.md (81.9) · BEAT_LB-MA-1.md (86.5) · BEAT_LB-COMP-1.md (80.3)
pip install "le-loopforge>=0.5.0" "le-loopctl>=0.5.0" "loopbench>=0.2.0" "loopgym>=0.1.3"
# see BEAT_LB-CR-1.md for full clone + run + submitpip install "le-loopforge>=0.5.0" "le-loopctl>=0.5.0" loopbench loopgym
loopbench suite list
loopbench run \
--suite suite-repair \
--spec submissions/examples/spec-fast-loop.yaml \
--seeds 0,1,2,3,4 \
-o results.json
loopbench validate results.json
loopbench rank results.jsonSubmit to the leaderboard: open a PR adding your entry to leaderboard/entries.json.
v0.2 accepts SimEnv submissions (fully reproducible, no API keys). LiveEnv tier is optional.
| Metric | Meaning |
|---|---|
| Success@k | Fraction of instances reaching goal threshold |
| LES_obs | Observed composite ∈ [0, 1] — eight categories |
| Grand composite | Mean of 4 suite scores — generalist rank |
| Cost | Estimated USD from LSS cost limits |
| Robustness | Quality retention across seeds |
Display scale 0–100 is optional (les × 100).
| You are… | LoopBench gives you… |
|---|---|
| Loop designer | A number you can improve release-over-release |
| Framework author | A neutral arena — not your own benchmark |
| Researcher | Reproducible tasks + published submission schema |
| Team lead | Comparable scores across designs and vendors |
@software{loopbench2026,
title={LoopBench: Benchmark Suite for Loop Engineering},
author={Malpani, Kanak},
year={2026},
url={https://pypi.org/project/loopbench/}
}MIT · v0.2.0 · Contributing
