LoopBench

The public scoreboard for loop engineering.

Fixed tasks. Fixed seeds. Observed LES. Submissions anyone can audit.

No hand-waved demos — bring an LSS spec, get a number, climb the leaderboard.

pip install "le-loopforge>=0.5.0" "le-loopctl>=0.5.0" loopbench loopgym
loopbench list
loopbench suite list

Run your first score · Live leaderboard · Loop Playground · Leaderboard JSON · Suite overview

🚀 What LoopBench measures

You submit a loop specification (LSS YAML). LoopBench:

Runs it through LoopGym on fixed task instances
Computes Success@k and LES_obs across eight categories
Validates your results.json against a published schema
Ranks you on the public leaderboard — generalist (grand composite) is the primary rank

loopbench run --suite suite-repair --spec your-loop.yaml --seeds 0,1,2,3,4 -o results.json
loopbench validate results.json
loopbench rank leaderboard/entries.json
loopbench rank leaderboard/entries.json --suite suite-repair

Suite coverage

One LSS spec → four suite scores → one generalist rank on the public leaderboard. No hand-waved demos. No private benchmarks. Just numbers anyone can reproduce.

What LoopBench gives you	Why it beats "we tried it internally"
Fixed tasks & seeds	Same inputs for every submission — apples to apples
LES_obs across 8 dimensions	Speed, cost, robustness, safety — not just pass/fail
Public leaderboard	Reputation you can link in a PR
Schema-valid results.json	No moving goalposts after the fact
Generalist rank	One number that rewards loops that work everywhere

Suite	Tasks	Stress area
`suite-repair`	5	verify-driven repair + safety
`suite-agent`	5	multi-agent coordination
`suite-knowledge`	4	research + RAG
`suite-rigor`	5	composition + HITL + memory

⚡ The measurement stack

flowchart LR
  YOU["Your LSS spec"]
  LB["LoopBench<br/>tasks · scoring · conformance"]
  LG["LoopGym<br/>SimEnv execution"]
  OUT["results.json → leaderboard"]

  YOU --> LB
  LB --> LG
  LG --> LB
  LB --> OUT

Layer	Owns	Repo
Spec	LSS schema, LES formulas	Loop Core Engineering
Data	Trajectories (holdout v0.2)	LoopNet
Runtime	`env.run_episode()`	LoopGym
Observability	LTF traces, iteration metrics	loop-observability
Measurement	Tasks, LES_obs, anti-gaming	LoopBench

LoopBench defines and scores. LoopGym runs. Never the other way around.

New to the stack? Start with the LoopNet end-to-end tutorial.

🧩 Suites and tasks (v0.2)

19 micro-tasks feed 4 comparison suites. Primary leaderboard rank = generalist (mean of suite scores).

Suite ID	Label	Micro-tasks
`suite-repair`	Repair & Verify	LB-CR-1, LB-REACT-1, LB-REFLEX-1, LB-OPT-1, LB-SAFE-1
`suite-agent`	Multi-Agent	LB-MA-1, LB-CREW-1, LB-GRAPH-1, LB-TOT-1, LB-VOTE-1
`suite-knowledge`	Research & RAG	LB-RS-1, LB-RAG-1, LB-BOOT-1, LB-AUTO-1
`suite-rigor`	Composition & Safety	LB-COMP-1, LB-NEST-1, LB-SIM-1, LB-HITL-1, LB-MEM-1

loopbench suite list
loopbench run --suite suite-repair --spec your-loop.yaml --seeds 0,1,2,3,4 -o results.json
loopbench run --task LB-CR-1 --spec your-loop.yaml --seeds 0,1,2,3,4 -o results.json

Full catalog in tasks/index.yaml and SUITE-OVERVIEW.md.

📈 Live leaderboard

Live board (updated 2026-06-30) — full rankings

Generalist:

Loop Engineering maintainer — LES 86.7
Loop Engineering maintainer (MA-1) — LES 86.5
Team Thorough — LES 86.4

Submit your loop →

📈 Validate and reproduce

Post your 60-minute reproduction report on the reproduction challenge after REPRODUCE.md.

Beat maintainer LES (good-first #4)

One command: BEAT_LB-CR-1.md — target LES_obs ≥ 86.7 on LB-CR-1.

Also: BEAT_LB-RS-1.md (81.9) · BEAT_LB-MA-1.md (86.5) · BEAT_LB-COMP-1.md (80.3)

pip install "le-loopforge>=0.5.0" "le-loopctl>=0.5.0" "loopbench>=0.2.0" "loopgym>=0.1.3"
# see BEAT_LB-CR-1.md for full clone + run + submit

⚡ Score in 2 minutes

pip install "le-loopforge>=0.5.0" "le-loopctl>=0.5.0" loopbench loopgym

loopbench suite list

loopbench run \
  --suite suite-repair \
  --spec submissions/examples/spec-fast-loop.yaml \
  --seeds 0,1,2,3,4 \
  -o results.json

loopbench validate results.json
loopbench rank results.json

Submit to the leaderboard: open a PR adding your entry to leaderboard/entries.json.

v0.2 accepts SimEnv submissions (fully reproducible, no API keys). LiveEnv tier is optional.

📝 Metrics explained

Metric	Meaning
Success@k	Fraction of instances reaching goal threshold
LES_obs	Observed composite ∈ `[0, 1]` — eight categories
Grand composite	Mean of 4 suite scores — generalist rank
Cost	Estimated USD from LSS cost limits
Robustness	Quality retention across seeds

Display scale 0–100 is optional (les × 100).

🎯 Who this is for

You are…	LoopBench gives you…
Loop designer	A number you can improve release-over-release
Framework author	A neutral arena — not your own benchmark
Researcher	Reproducible tasks + published submission schema
Team lead	Comparable scores across designs and vendors

📝 Citation

@software{loopbench2026,
  title={LoopBench: Benchmark Suite for Loop Engineering},
  author={Malpani, Kanak},
  year={2026},
  url={https://pypi.org/project/loopbench/}
}

_{MIT · v0.2.0 · Contributing}

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
.github/workflows		.github/workflows
assets		assets
cli		cli
docs		docs
leaderboard		leaderboard
loopbench		loopbench
metrics		metrics
scripts		scripts
submissions/examples		submissions/examples
submit		submit
tasks		tasks
tests		tests
.gitignore		.gitignore
.platform-pack-source		.platform-pack-source
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PLAN.md		PLAN.md
PUBLISHING.md		PUBLISHING.md
README.md		README.md
SECURITY.md		SECURITY.md
STATUS.md		STATUS.md
SUITE-OVERVIEW.md		SUITE-OVERVIEW.md
SYNC.md		SYNC.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LoopBench

🚀 What LoopBench measures

Suite coverage

⚡ The measurement stack

🧩 Suites and tasks (v0.2)

📈 Live leaderboard

📈 Validate and reproduce

Beat maintainer LES (good-first #4)

⚡ Score in 2 minutes

📝 Metrics explained

🎯 Who this is for

📝 Citation

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LoopBench

🚀 What LoopBench measures

Suite coverage

⚡ The measurement stack

🧩 Suites and tasks (v0.2)

📈 Live leaderboard

📈 Validate and reproduce

Beat maintainer LES (good-first #4)

⚡ Score in 2 minutes

📝 Metrics explained

🎯 Who this is for

📝 Citation

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages