"Good news, everyone!"
Farnsworth Loop is a Claude Code plugin that runs best-of-N tournaments. You hand it a task; it produces N independent attempts in parallel, then a blind Anthropic Opus reviewer scores them, lists pros and cons, ranks them, and names a winner. The attempts can come from any mix of providers (Anthropic, GLM, on-device MLX, OpenAI Codex, MiniMax); the judge is always Opus, held fixed so the comparison stays honest.
@@FL:5 Build a CLI that flattens nested JSON to dotted keys.
That one line triggers the loop: it asks which model(s) to run the 5 attempts on, you answer, it fans out 5 isolated workers, and a blind Opus reviewer crowns a winner. Add :2 for two passes (a guided second round), or a :Z for grand loops (an unattended chain that implements each winner into a real branch and opens a PR).
- The core idea
- Single pass vs two pass
- Invoking it: the sigil and prose forms
- Model providers
- Diversity injection
- Grand loops (
Z >= 2) - The dogfood backlog
- Installation & setup
- The benchmarking system (
fl-bench) - Repository layout
- Honest limitations
A single LLM attempt at a task is one sample from a noisy distribution. The Farnsworth Loop spends tokens to do better than one sample, in two specific ways:
-
Generate, don't iterate. Run N attempts in parallel, each a single-pass exploration — every attempt writes its solution once and stops. No attempt is told it's competing or being judged; none sees another's work. The refinement happens at the tournament level (many diverse one-shots → review), never inside a single attempt grinding "until it works." A rough or even failed attempt is useful signal, not a wasted slot.
-
Judge blind, with a fixed strong judge. One Anthropic Opus reviewer receives the deliverables labelled
Candidate A,B,C, … with no model identities attached. It reads (and where feasible runs) each one, scores against task-appropriate criteria, lists concrete pros and cons, ranks them, and names a winner with reasoning. Because the judge never learns which model produced which candidate, a cheap model can win on merit — and the engine takes mechanical steps (below) to keep that blindness real.
The attempts are deliberately diverse: different model families, sampling stochasticity, and a per-attempt framing nudge (diversity injection) all push the N solutions apart so the review has genuinely different things to compare.
The two modes share one spine. Two pass is single pass plus a learning step in the middle.
| Phase | Single pass | Two pass |
|---|---|---|
| Round 1 attempts | N parallel, isolated, one diversity nudge each | identical |
| Blind Opus review | scores, ranks, names winner → this is the result | scores, ranks, names round-1 winner and distils guidance |
| Carry / discard | — | save the winner's deliverable; discard every other artifact, keep only the distilled lessons |
| Round 2 attempts | — | N fresh attempts given the task + guidance (positives to emulate, pitfalls to avoid) — but never round-1 code |
| Final rank | — | pool = N round-2 attempts + the saved round-1 winner (N+1), re-labelled blind, one Opus ranker picks overall winner |
The distilled guidance is two short lists — positives to consider and challenges to avoid — phrased as generic principles, each tagged [strong] (held up repeatedly) or [tentative] (a single sighting), with no candidate-specific code.
Why two pass discards the losing code but keeps the lessons: re-using a winner's code would make round 2 copy it and collapse the diversity that makes the loop work. Re-using the distilled pros and cons keeps diversity while raising the floor. The saved round-1 champion then competes blind in the final pool on the merits — it produced no worse work for not having seen the guidance, and if a guided round-2 attempt is genuinely better, it should win.
SINGLE PASS
task ──▶ [N attempts] ──▶ blind Opus review ──▶ winner ✓
TWO PASS
task ──▶ [N attempts] ──▶ blind Opus review ──┬─▶ distil guidance ─┐
└─▶ save winner ──┐ │
│ ▼
[N fresh attempts + task + guidance] ◀────────────────┼──┘
│ │
▼ │
final pool = N round-2 + saved winner ◀───────────────┘
│
▼
blind Opus final rank ──▶ overall winner ✓
Two pass roughly doubles the attempt count (≈ 2N attempts + 2 Opus passes), so it costs about double single pass. The skill confirms volume before spending at N ≥ 8 (single pass) or N ≥ 6 (two pass).
Put a trigger anywhere in your message; the text before it is the task. All forms are case-insensitive with optional spaces around the colons. Phase 0 of the skill runs a bundled parser (bin/fl-parse.mjs) on your raw message — it does not hand-parse the sigil — and acts on the JSON it returns.
Write the trigger literally, unquoted.
@@FL:5is correct; don't wrap it in shell quotes.
@@FL[:N][:M[:Z]]
| Field | Meaning | Default | Constraints |
|---|---|---|---|
| N | attempts per round | none (asks, or inferred) | integer ≥ 2 |
| M | passes | 1 (single pass) | 1 or 2; any other value is an error |
| Z | grand loops | 1 (isolated tournament) | integer 1..5; Z > 5 is refused (split into batches) |
@@FL:5→ 5 attempts, single pass.@@FL:5:2→ 5 attempts, two pass.@@FL:5:2:3→ 5 attempts, two pass, 3 grand loops.@@FL(bare) → falls back to the interactive model gate (it asks N and the model).
Positional skips are forbidden. @@FL:5::3 is invalid — to set Z with a default M, write @@FL:5:1:3. (Z=1 and omitting Z are byte-identical: today's isolated tournament.)
There is also a prose marker that extends identically: farnsworth loop:N[:M[:Z]] — e.g. do abc :farnsworth loop:5 (single) or do abc: farnsworth loop:5:2 (two pass).
You can describe the fleet in prose instead of giving an explicit N. A comma- or and-separated list of <count> <model> items anywhere in the message becomes the per-attempt assignment; the sum of the counts is N, and the spec text is stripped from the task:
@@FL two passes, 4 opus, 2 sonnet, 2 codex high, 2 glm 4.7, 2 minimax Refactor the auth module.
Counts: 4 + 2 + 2 + 2 + 2 = 12, so N = 12 with the per-attempt assignment
[opus×4, sonnet×2, codex-high×2, glm-4.7×2, minimax-m3×2], run as two pass (the phrase "two passes" sets M=2). Because the spec already answers "which models," the interactive model menu is skipped.
A spec is recognised as <count> <model> items only; an ordinary <digit> <noun> in the task (e.g. "fix 3 bugs") is not a spec. If a model token isn't recognised the parser stops and asks rather than silently dropping it (a dropped token would change N).
- Mixed (Specify Mix): choose a concrete model for each attempt, one at a time. A prose spec is a Mixed assignment.
- Top Mixed: the keyword
top mixed(alsotop-mix/top mix) plus an N spreads N as evenly as possible across[opus, glm-5.2, codex-high], remainder priorityopus > glm-5.2 > codex-high(andN=2→[opus, glm-5.2]). N can come from the sigil or a leading count.
@@FL:6 top mixed Design a rate-limiter.
N = 6 over three buckets → 2 each → [opus, opus, glm-5.2, glm-5.2, codex-high, codex-high].
@@FL:5 top mixed ...
N = 5 → base 1 each (3), remainder 2 by priority → opus +1, glm-5.2 +1 → [opus, opus, glm-5.2, glm-5.2, codex-high].
| Invocation | N | Mode | Z | Assignment / behaviour |
|---|---|---|---|---|
@@FL:4 |
4 | single | 1 | asks the model gate, then 4 attempts |
@@FL:5:2 |
5 | two | 1 | 5 attempts/round, guided round 2, final rank |
@@FL |
gate | gate | 1 | bare → interactive gate (N defaults to 6, passes to 2 in the gate) |
@@FL:6 top mixed |
6 | single | 1 | [opus, opus, glm-5.2, glm-5.2, codex-high, codex-high] |
@@FL 2 opus, 2 glm 5.2, 1 codex high |
5 | single | 1 | [opus, opus, glm-5.2, glm-5.2, codex-high] (N inferred from the spec) |
@@FL two passes, 4 opus, 2 sonnet, 2 codex high, 2 glm 4.7, 2 minimax |
12 | two | 1 | [opus×4, sonnet×2, codex-high×2, glm-4.7×2, minimax-m3×2] |
@@FL:5:2:3 |
5 | two | 3 | grand-loop chain (authorization + per-loop PR) |
farnsworth loop:7:2 |
7 | two | 1 | prose marker, same as @@FL:7:2 |
Note the two different defaults. When
@@FLhas no N and no spec, the parser returnsneedsGate— control passes to the interactive gate, where N defaults to 6 and passes to 2. That gate default is not the grammar default: an explicit@@FL:5(no M) is single pass, because the sigil's M defaults to 1. Don't conflate the gate's "passes default to 2" with the sigil's "M defaults to 1."
If you describe a generate-and-rank tournament in plain English with no marker at all, the skill can still infer single vs two pass and ask for N and the model.
Attempts can run on five providers. The reviewer and final ranker are always Anthropic Opus, dispatched via the Task tool — never anything else — so scoring stays consistent across attempts and rounds. Each non-Anthropic provider runs through a bundled runner script invoked by a thin command-runner agent (see layout); this indirection is what makes those paths reliable.
| Provider | Models (selectable axis) | Dispatch | Auth (from the environment) |
|---|---|---|---|
| Anthropic | opus · sonnet · haiku |
Task tool, native in-process | the session's own Claude auth (no extra key) |
| GLM (z.ai) | glm-5.2 · glm-5.1 · glm-4.7 · glm-4.5-air |
claude pointed at z.ai via bin/glm-run.sh |
ZAI_API_KEY |
| Local MLX | dynamic on-device list (via the omlx server) |
claude pointed at http://127.0.0.1:8000 via bin/local-run.sh |
OMLX_AUTH_TOKEN |
| Codex (OpenAI) | gpt-5.5, axis = reasoning effort: codex-low/medium/high/xhigh |
codex exec via bin/codex-run.sh |
~/.codex/auth.json (no env var) |
| MiniMax | MiniMax-M3 (the only model) |
claude pointed at the MiniMax endpoint via bin/minimax-run.sh |
MINIMAX_API_KEY |
A few provider specifics worth knowing:
- Anthropic model aliases map to API strings
claude-opus-4-8/claude-sonnet-4-6/claude-haiku-4-5. - GLM is the
claudeCLI pointed at z.ai's Anthropic-compatible endpoint. The selection maps throughglm's--modelflag, which is not the GLM name:glm-5.2 → --model opus,glm-5.1 → --model glm-5.1,glm-4.7 → --model sonnet,glm-4.5-air → --model haiku. (Those aliases resolve to GLM models because the wrapper setsANTHROPIC_DEFAULT_{OPUS,SONNET,HAIKU}_MODEL.) GLM bills the z.ai plan and is the slow one on large multi-file tasks — give it a generousglmTimeoutSecs. - Local MLX has a dynamic catalogue: fetch it live with
omlx-models(orcurl -s http://127.0.0.1:8000/v1/models -H "Authorization: Bearer $OMLX_AUTH_TOKEN" | jq -r '.data[].id'). Ids pass straight through as--model <exact-id>. Local models are free (on-device) but slower, and small ones can be unreliable at saving a deliverable; prefer hosted providers for heavy writing tasks. - Codex is pinned to
gpt-5.5— the only model the local ChatGPT-account auth serves (other ids return HTTP 400 unless you setOPENAI_API_KEYfor API-key billing). So the lever is reasoning effort:low | medium | high | xhigh("Extra high";minimalis rejected). Codex has no turn cap, so its only per-attempt backstop is the wall clock (codexTimeoutSecs, default 600s). It bills your OpenAI/ChatGPT plan. - MiniMax exposes one model,
MiniMax-M3(512K context), pinned viaANTHROPIC_MODEL=MiniMax-M3(so there's no--modelflag). It bills your MiniMax plan; M3 was fast and clean on a heavy multi-file build in testing.
Every runner reads its key from the environment — never by sourcing or grepping rc files — so the providers stay uniform. A provider whose key is unset produces an honest failure, not a fake fallback.
Provenance check (the anti-spoofing guard). Every non-Anthropic attempt writes a marker into its run log proving it actually hit the intended endpoint — FARNSWORTH-GLM-PROVENANCE endpoint=api.z.ai, FARNSWORTH-LOCAL-PROVENANCE endpoint=127.0.0.1:8000, FARNSWORTH-CODEX-PROVENANCE endpoint=api.openai.com, FARNSWORTH-MINIMAX-PROVENANCE endpoint=api.minimax.io — plus a DONE exit=0 with no TIMEOUT/ERROR. The validator is line-anchored and provider-specific (^FARNSWORTH-<PROV>-…), so an attempt whose own deliverable merely mentions a marker token can't false-fail. A candidate with no marker or no saved file is treated as a failed attempt and excluded; the round proceeds over the survivors.
Independent attempts only help if they actually differ; same-model siblings on an identical prompt tend to converge. Diversity injection (default on) gives each attempt a distinct framing drawn from a modifier pool, so any divergence is attributable to the modifier. The draw is seeded and logged for reproducibility, without replacement within a round, and biased so same-model siblings get the most-different nudges.
- Pool A — approach nudges (default on, blind-safe). These vary how an attempt starts and proceeds, not what counts as a good answer (e.g. "from first principles," "test-first," "happy path first then harden," or for prose "lead with your strongest claim," "write for a smart, skeptical reader"). Because they don't move the success criteria, blind review is preserved. Pool A is task-type-aware: a light heuristic picks a code-flavoured or prose-flavoured set so a "data model" nudge isn't prepended to an essay.
- Pool B — objective lenses (opt-in only). These deliberately bias the tradeoff (
safely,quickly,efficiently,robustly,minimally, …). Useful to fan attempts across a tradeoff frontier on purpose — but an attempt told "quickly" may correctly produce something fast and thin that a completeness-minded blind reviewer marks down. Pick one honest handling and state it: best-overall (keep review blind; the lens is exploration spice) or judge-to-intent (pass the lens to the reviewer, breaking blindness). If you didn't opt in, only Pool A is used.
In two pass, round 2 takes a fresh Pool A draw only — the distilled guidance already carries the objective steering, so a conflicting Pool B lens would send mixed signals.
Z=1 (or omitted) is the isolated tournament: it touches no repo and opens no PR. Z >= 2 turns the tournament into an unattended chain that, for each of Z loops, runs a full tournament, implements the winning proposal into your real repository on a new branch, verifies it, and opens one PR. It never auto-merges.
Per loop k (FAN topology — the default):
- STOP-file kill switch checked at the top of every iteration — create
<runDir>/STOPat any time to halt before the next loop, without killing the harness. - Branch off base:
FL-<loop>-<random7>off the branch you started on. (This name overrides the globalrob/branch-prefix rule for loop branches only.) - Run the tournament (the unchanged engine), with the task augmented by a cross-loop ledger of what prior loops already proposed, so each loop attacks something different.
- Implement the winner via the Opus
farnsworth-implementeragent — the only actor that writes to the real repo. It makes the smallest coherent change on the branch, leaves it unstaged, records ambiguities inFL-NOTES.md, and never runs git. - Verify, fail-closed. Auto-detected commands (npm scripts,
ruff/pytest,make test/check,cargo,go) run fail-fast; a failure or "no verify commands" routes the PR to draft +needs-human(with a capped tail of the failing output) and halts the chain. - Commit, push, open the PR (one per loop, individually mergeable off base). A per-loop DONE marker is written only after the PR exists, so a re-run skips completed loops.
At the end the driver switches back to your starting branch. Safety rails:
- One front-loaded authorization (Phase 0b) replaces the per-dispatch gate for the whole chain — the only way an unattended chain is compatible with "stop and ask." A zero-token preflight runs first (work tree clean?
ghauthenticated? remote resolves? base resolves?) and refuses on a dirty tree. You must re-type Z to proceed — friction proportional to blast radius. Z_MAX = 5. The parser refusesZ > 5outright (echoing the offending Z, telling you to split into batches); the re-type guards a valid-but-large Z.- Non-implementable-task check. A task that produces a standalone artifact unrelated to the repo (a haiku, an email) would open empty PRs, so grand loops offer
Z=1instead. - STACK topology (each loop off the previous) is opt-in (
topology=stackin your message) and forces halt-on-failure; FAN (independent PRs off base) is the default. - Mid-loop death leaves an orphan
FL-<k>-*branch with no DONE marker; on re-entry the driver stops and tells you to inspect/delete it — it never auto-resumes a half-applied step.
The orchestration home is the skill procedure plus bin/fl-git.sh (which owns all git/gh side effects). The tournament engine (workflows/tournament.mjs) is unchanged and never touches the repo — that purity is its safety guarantee.
Problems found while running tournaments are filed as GitHub Issues labelled dogfood (the live backlog), so they survive the gitignored .runs/ directory and get triaged later. All forge access is confined to one helper, bin/fl-issue.sh, so the engine stays forge-agnostic.
bin/fl-issue.sh bootstrap # (once) create the label scheme
bin/fl-issue.sh new --sev sev2 --area parse \
--title "…" --evidence-file EV.md # file an item (dedups first)
bin/fl-issue.sh next # top open item (sev1 → sev3)
bin/fl-issue.sh claim <N> <run-id> # best-effort claim
# fix on a rob/dogfood-<N> branch, open one PR with "Closes #<N>"- Severity
sev1(wrong winners) ·sev2(degraded but usable) ·sev3(cosmetic/docs); area labelsarea:review|runner|parse|git|skill|docs|infra. - Every item needs a verbatim evidence excerpt (the helper refuses empty/placeholder evidence).
- PUBLIC repo: never paste secrets or the
mapping.jsonunblinding line — refer to a candidate as "blind B," not the model (the helper has refusal greps for both). - Claiming is best-effort, not a mutex: the GitHub API has no compare-and-swap, so
claimis a TOCTOU read-after-write with a deterministic tiebreak; a git-ref push underrefs/dogfood-claims/is the strict escape hatch for high fan-out / grand loops. - No
gh/ offline?newdegrades to a committed draft underdocs/dogfood/inbox/(never.runs/); re-file later withdrain-inbox. - Legacy
D-NNNNitems live read-only underdocs/dogfood/archive/.
Farnsworth Loop is a Claude Code plugin (farnsworth-loop, skills + agents + workflows + bin runners). It is published through the my-skills marketplace (robanderson/claude-my-skills), which references this repo as an external plugin source — so a new user installs it from Claude Code with:
/plugin marketplace add robanderson/claude-my-skills
/plugin install farnsworth-loop@my-skills
Once installed, the @@FL trigger and the farnsworth-loop / farnsworth-bench skills are available in your sessions; the bundled scripts under bin/ are run with node / bash from the resolved plugin root. Confirm the farnsworth-loop skill appears after installing.
Enabling dynamic workflows. The preferred backend runs on Claude Code's dynamic-workflow orchestration. Turn it on by upgrading effort to its maximum: run
/effortand select ultracode (max = xhigh reasoning + dynamic workflow orchestration). With it on, the tournament fans out through theWorkflowengine and is watchable live in/workflows. Without it, the skill automatically falls back to manual Task-tool +glmCLI dispatch — the same tournament, just not driven by the workflow engine.
Anthropic only is enough to start. Opus/Sonnet/Haiku attempts and the Opus judge use your session's own Claude auth — no extra keys. Everything below is optional, needed only if you want attempts on that provider:
| Provider | You need | Set how |
|---|---|---|
| Anthropic | nothing extra | session auth |
| GLM (z.ai) | the glm CLI + ZAI_API_KEY |
export ZAI_API_KEY in your shell profile |
| Local MLX | the omlx server running on 127.0.0.1:8000 + OMLX_AUTH_TOKEN |
start omlx; export OMLX_AUTH_TOKEN |
| Codex (OpenAI) | the codex CLI, signed in (~/.codex/auth.json) |
codex login; optional OPENAI_API_KEY for non-gpt-5.5 ids |
| MiniMax | the claude CLI + MINIMAX_API_KEY |
export MINIMAX_API_KEY in your shell profile |
Every runner reads its key from the environment (exported in your shell profile and inherited into the session at launch), exactly the same way for every provider. The skill probes liveness before spending a round where it can — e.g. a one-line Codex probe, an omlx-models fetch — and offers another tier if a provider is down or its CLI is stale (e.g. brew upgrade codex).
The farnsworth-bench skill (trigger: ask to benchmark model speed, or run /fl-bench) is a thin wrapper over bin/fl-bench.mjs. It measures generation throughput (tokens/second) for every model the system can call, on a cold call (first call — connection/cache/route warmup for hosted providers, a genuine weight-load only for local MLX) and an immediate hot call (second identical call). It prints a table and appends every result to <plugin>/.bench/results.jsonl (append-only, written per-model immediately, so a crashed sweep keeps what it produced).
# every callable model (local MLX list discovered live), cold + hot each:
node "<plugin-root>/bin/fl-bench.mjs" --models all
# a subset:
node "<plugin-root>/bin/fl-bench.mjs" --models anthropic,glm
node "<plugin-root>/bin/fl-bench.mjs" --models glm:glm-5.1,codex:codex-high,opus
# dry-run: print the resolved plan, make NO model calls (cheap, testable):
node "<plugin-root>/bin/fl-bench.mjs" --list --models all
# heavy profile (slower + pricier):
node "<plugin-root>/bin/fl-bench.mjs" --models opus,minimax-m3 --profile heavySelection grammar (--models, comma-separated, de-duped): all (default) · a provider (anthropic|glm|local|codex|minimax) · <provider>:<id> (e.g. glm:glm-5.1, codex:codex-high, local:<omlx-id>) · a bare id (opus, glm-5.2, minimax-m3, codex-high, a local id).
Two workload profiles (--profile, default light; --heavy/--light shorthand):
| Profile | Input | Output cap | Timeouts (default/local) | Use |
|---|---|---|---|---|
light (default) |
~200-word paragraph | 2048 | 240s / 600s | fast/cheap throughput smoke |
heavy |
fixed >5k-token code context | 8192 | 600s / 1200s | representative coding/agentic load: drives a >5k-token decode |
The light cap is 2048, not a few hundred: an extended-thinking model rejects a sub-1024 output cap (the thinking-budget floor), which is why both profile caps sit above it. The heavy profile is for when the light profile's small decode is too tiny to characterise real coding throughput.
tok/s = output_tokens / generation_wall_seconds, using the provider's real token counts (claude-family parses claude -p --output-format json --verbose; local uses omlx usage.completion_tokens; codex uses a real usage event if present, else a flagged chars/4 estimate). Auth comes from the environment exactly as the runners do; a provider whose key is unset is recorded as a failed row and the sweep continues. Each result row records the profile name, cold/hot output tokens, the real input size, seconds, and a timestamp.
Useful tips: run --list first to confirm (and price) the plan before the real paid sweep; the heavy all-models sweep is much slower and pricier (slow local models can approach the 1200s local timeout), so warn yourself and consider a representative subset. Full reference: bin/README.fl-bench.md.
A Claude Code plugin: a manifest, two skills, eight agents, the workflow engine, and the bin helpers. How the pieces fit:
farnsworth-loop/
├── plugin.json # plugin manifest (name, version, skills, agents)
├── DOGFOOD.md # pointer stub → live backlog is GitHub Issues
├── skills/
│ ├── farnsworth-loop/
│ │ ├── SKILL.md # the orchestration procedure (Phases 0–7); runs the loop
│ │ └── references/
│ │ ├── orchestration.md # dispatch mechanics, model ids, runner args, run layout
│ │ ├── diversity-injection.md # Pool A / Pool B, sampling rules
│ │ ├── review-rubric.md # the Opus reviewer/ranker scoring + distillation rubric
│ │ └── dogfood.md # the dogfood-backlog convention
│ └── farnsworth-bench/
│ └── SKILL.md # /fl-bench wrapper over bin/fl-bench.mjs
├── agents/ # bundled worker agents
│ ├── farnsworth-glm-5-2.md # GLM workers (one per GLM model) — run bin/glm-run.sh
│ ├── farnsworth-glm-5-1.md
│ ├── farnsworth-glm-4-7.md
│ ├── farnsworth-glm-4-5-air.md
│ ├── farnsworth-local.md # one generic local worker — runs bin/local-run.sh
│ ├── farnsworth-codex.md # one generic codex worker — runs bin/codex-run.sh
│ ├── farnsworth-minimax.md # one generic minimax worker — runs bin/minimax-run.sh
│ └── farnsworth-implementer.md # (grand loops) Opus; applies the winner to the real repo
├── workflows/
│ └── tournament.mjs # the dynamic-workflow engine: attempts → blind review → (two pass) round 2 → final rank
├── bin/ # runners + helpers (run with node / bash)
│ ├── fl-parse.mjs # Phase 0 invocation parser (sigil, prose spec, Top Mixed, Z)
│ ├── fl-git.sh # ALL git/gh side effects for grand loops (preflight, branch, verify, PR, markers)
│ ├── fl-issue.sh # the only forge-touching part: dogfood GitHub-Issues helper
│ ├── glm-run.sh # provider runner scripts (build a benign command, set env,
│ ├── local-run.sh # call the nested claude/codex, write provenance markers)
│ ├── codex-run.sh
│ ├── minimax-run.sh
│ ├── fl-bench.mjs # the throughput benchmark
│ └── README.fl-bench.md # fl-bench usage + results-format reference
├── docs/
│ ├── dogfood/archive/ # read-only legacy D-NNNN evidence
│ ├── dogfood/inbox/ # committed offline drafts (no-gh fallback)
│ └── *.md # design notes (grand loops, persisted outcomes, …)
└── .bench/results.jsonl # append-only fl-bench history
How the pieces fit together at run time:
- SKILL.md is the driver. On a trigger it runs
bin/fl-parse.mjs(Phase 0), the mandatory model gate (Phase 1), diversity injection (Phase 1b), then dispatches. workflows/tournament.mjsis the preferred backend. Invoked via theWorkflowtool, it runs the parallel attempts, the blind Opus review, and (two pass) round 2 + final rank deterministically — watchable live in/workflows. It returns the structured mapping + rankings the skill reports in Phase 6. (Fallback when workflows are unavailable: manual Task-tool +glmCLI dispatch.)- Agents are thin command-runners. Anthropic attempts run native via the Task tool. Each non-Anthropic attempt runs through its wrapper agent (a cheap Bash-only driver) executing the matching
bin/*-run.sh— the script sets provider env, closes stdin, calls the nestedclaude/codex, and writes the provenance marker. This indirection exists because a wrapper handed a raw nested command proved unreliable (it would solve the task itself, refuse on "safety," or let the weak inner model bail without saving); it matters most for codex, a fully autonomous external agent. - The judge is fixed. Reviewer and final ranker are always Opus via the Task tool. Before judging, the engine stages each deliverable into a clean blind tree (copying files and deleting the known engine files —
_brief.txt, the run logs — by exact name, an allowlist that keeps legitimately_-prefixed deliverables), validates the success provenance contract, and pools the valid deliverables into one blind-labelled_pool.mdthe judge reads — failing closed on any invalid candidate. - Grand loops add
farnsworth-implementer(the only repo-writer) andbin/fl-git.sh(all git/gh); the tournament engine itself stays repo-pure.
The source is candid about these, and you should be too:
- Blindness has honour-system edges. The judge gets
Read/Bashand the absolute run dir, so it could walk to a sibling workspace and read a provenance log; the prompt tells it not to, but that part isn't mechanically enforced. And while the blind letter is decorrelated from dispatch order, the presentation order in the pool is fixed, so any positional bias in the judge is uncorrected — weight on merits, not order. - Cost grows with N and mode. Single pass ≈ N attempts + 1 Opus pass; two pass ≈ 2N + 2; grand loops multiply that by Z and add an Opus implementer + verify per loop. Cost is explicitly not the design constraint here, but it is real — the skill confirms volume at higher N and
Z_MAXcaps the chain at 5. - Weaker models fail honestly.
glm-4.5-airand small local MLX models sometimes don't save a deliverable; codex can refuse or run to its wall-clock timeout. These are excluded as failed attempts, not papered over, and the round proceeds over the survivors. On-device models are poorly suited to heavy writing deliverables (one local model timed out at 600s on a ~27KB proposal) — prefer hosted providers there. - Claiming a dogfood item is best-effort, not a lock (no GitHub compare-and-swap); use the git-ref escape hatch when you need strict exclusivity.
- Grand loops never auto-merge, never resume. A failed verify halts the chain into a draft
needs-humanPR; a mid-loop death stops and asks a human to clean up the orphan branch. By design, you are always the one who merges.
Ralph persists. Karpathy measures. Farnsworth learns.
