Skip to content

KanakMalpani/LoopBench

Repository files navigation

LoopBench

The public scoreboard for loop engineering.

Fixed tasks. Fixed seeds. Observed LES. Submissions anyone can audit.

No hand-waved demos — bring an LSS spec, get a number, climb the leaderboard.


CI PyPI License: MIT Tasks Suites


pip install "le-loopforge>=0.5.0" "le-loopctl>=0.5.0" loopbench loopgym
loopbench list
loopbench suite list

Run your first score · Live leaderboard · Loop Playground · Leaderboard JSON · Suite overview


🚀 What LoopBench measures

You submit a loop specification (LSS YAML). LoopBench:

  1. Runs it through LoopGym on fixed task instances
  2. Computes Success@k and LES_obs across eight categories
  3. Validates your results.json against a published schema
  4. Ranks you on the public leaderboard — generalist (grand composite) is the primary rank
loopbench run --suite suite-repair --spec your-loop.yaml --seeds 0,1,2,3,4 -o results.json
loopbench validate results.json
loopbench rank leaderboard/entries.json
loopbench rank leaderboard/entries.json --suite suite-repair

Suite coverage

One LSS spec → four suite scores → one generalist rank on the public leaderboard. No hand-waved demos. No private benchmarks. Just numbers anyone can reproduce.

LoopBench suite and task coverage
What LoopBench gives you Why it beats "we tried it internally"
Fixed tasks & seeds Same inputs for every submission — apples to apples
LES_obs across 8 dimensions Speed, cost, robustness, safety — not just pass/fail
Public leaderboard Reputation you can link in a PR
Schema-valid results.json No moving goalposts after the fact
Generalist rank One number that rewards loops that work everywhere
Suite Tasks Stress area
suite-repair 5 verify-driven repair + safety
suite-agent 5 multi-agent coordination
suite-knowledge 4 research + RAG
suite-rigor 5 composition + HITL + memory

⚡ The measurement stack

flowchart LR
  YOU["Your LSS spec"]
  LB["LoopBench<br/>tasks · scoring · conformance"]
  LG["LoopGym<br/>SimEnv execution"]
  OUT["results.json → leaderboard"]

  YOU --> LB
  LB --> LG
  LG --> LB
  LB --> OUT
Loading
Layer Owns Repo
Spec LSS schema, LES formulas Loop Core Engineering
Data Trajectories (holdout v0.2) LoopNet
Runtime env.run_episode() LoopGym
Observability LTF traces, iteration metrics loop-observability
Measurement Tasks, LES_obs, anti-gaming LoopBench

LoopBench defines and scores. LoopGym runs. Never the other way around.

New to the stack? Start with the LoopNet end-to-end tutorial.


🧩 Suites and tasks (v0.2)

19 micro-tasks feed 4 comparison suites. Primary leaderboard rank = generalist (mean of suite scores).

Suite ID Label Micro-tasks
suite-repair Repair & Verify LB-CR-1, LB-REACT-1, LB-REFLEX-1, LB-OPT-1, LB-SAFE-1
suite-agent Multi-Agent LB-MA-1, LB-CREW-1, LB-GRAPH-1, LB-TOT-1, LB-VOTE-1
suite-knowledge Research & RAG LB-RS-1, LB-RAG-1, LB-BOOT-1, LB-AUTO-1
suite-rigor Composition & Safety LB-COMP-1, LB-NEST-1, LB-SIM-1, LB-HITL-1, LB-MEM-1
loopbench suite list
loopbench run --suite suite-repair --spec your-loop.yaml --seeds 0,1,2,3,4 -o results.json
loopbench run --task LB-CR-1 --spec your-loop.yaml --seeds 0,1,2,3,4 -o results.json

Full catalog in tasks/index.yaml and SUITE-OVERVIEW.md.


📈 Live leaderboard

Live board (updated 2026-06-30) — full rankings

Generalist:

  • Loop Engineering maintainer — LES 86.7
  • Loop Engineering maintainer (MA-1) — LES 86.5
  • Team Thorough — LES 86.4

Submit your loop →


📈 Validate and reproduce

Post your 60-minute reproduction report on the reproduction challenge after REPRODUCE.md.

Beat maintainer LES (good-first #4)

One command: BEAT_LB-CR-1.md — target LES_obs ≥ 86.7 on LB-CR-1.

Also: BEAT_LB-RS-1.md (81.9) · BEAT_LB-MA-1.md (86.5) · BEAT_LB-COMP-1.md (80.3)

pip install "le-loopforge>=0.5.0" "le-loopctl>=0.5.0" "loopbench>=0.2.0" "loopgym>=0.1.3"
# see BEAT_LB-CR-1.md for full clone + run + submit

⚡ Score in 2 minutes

pip install "le-loopforge>=0.5.0" "le-loopctl>=0.5.0" loopbench loopgym

loopbench suite list

loopbench run \
  --suite suite-repair \
  --spec submissions/examples/spec-fast-loop.yaml \
  --seeds 0,1,2,3,4 \
  -o results.json

loopbench validate results.json
loopbench rank results.json

Submit to the leaderboard: open a PR adding your entry to leaderboard/entries.json.

v0.2 accepts SimEnv submissions (fully reproducible, no API keys). LiveEnv tier is optional.


📝 Metrics explained

Metric Meaning
Success@k Fraction of instances reaching goal threshold
LES_obs Observed composite ∈ [0, 1]eight categories
Grand composite Mean of 4 suite scores — generalist rank
Cost Estimated USD from LSS cost limits
Robustness Quality retention across seeds

Display scale 0–100 is optional (les × 100).


🎯 Who this is for

You are… LoopBench gives you…
Loop designer A number you can improve release-over-release
Framework author A neutral arena — not your own benchmark
Researcher Reproducible tasks + published submission schema
Team lead Comparable scores across designs and vendors

📝 Citation

@software{loopbench2026,
  title={LoopBench: Benchmark Suite for Loop Engineering},
  author={Malpani, Kanak},
  year={2026},
  url={https://pypi.org/project/loopbench/}
}

MIT · v0.2.0 · Contributing

About

MLPerf-style benchmark suite for Loop Engineering

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors