fix(bench): default all models to deepseek-v4-flash (gpt-5 hangs; OpenAI+Claude credits out)#297
Conversation
…pt-5 hangs) gpt-5 via the router hangs (no response, 30s+ timeout) — a bench launched without an explicit WORKER_MODEL/JUDGE_MODEL silently stalled on the first task. Flip every runtime default + default profile model from gpt-5 to gpt-4.1 (a working model). Env override is unchanged; this only fixes the broken default. Test-fixture model labels (corpus.test.mts) left as-is (not live calls).
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 5e45c88d
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T14:51:25Z
…ude credits out) OpenAI and Claude credits are exhausted, so every gpt-*/claude-* default fails via the router. Flip all runtime + profile default models (gpt-4.1, gpt-4o, gpt-4o-mini, gpt-3.5-turbo, gpt-5, claude-sonnet-4-6) to deepseek-v4-flash — a working, cheap model on the Tangle router (returns content, hybrid reasoning). Env override unchanged; gemini judges left as-is (separate credits). Supersedes the gpt-4.1 default from the prior commit.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 013fbecc
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T14:54:56Z
tangletools
left a comment
There was a problem hiding this comment.
🟠 Value Audit — better-approach-exists
| Verdict | better-approach-exists |
| Concerns | 5 (1 strong-concern, 2 medium-concern, 1 low, 1 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 260.9s (2 bridge agents) |
| Total | 260.9s |
💰 Value — better-approach-exists
Switches bench defaults to deepseek-v4-flash to dodge gpt-5 hangs and exhausted OpenAI/Claude credits, but leaves provider pairings and a few other defaults inconsistent and repeats the new literal across two dozen files instead of centralizing it.
- What it does: Replaces the hardcoded default model in 24 bench scripts and in bench/src/profiles.ts from gpt-5/gpt-4.1/gpt-4o/gpt-4o-mini/gpt-3.5-turbo/claude-sonnet-4-6 to deepseek-v4-flash, while preserving env overrides (WORKER_MODEL, JUDGE_MODEL, REFLECT_MODEL, MODEL, EVAL_GATE_MODEL).
- Goals it achieves: Make an out-of-the-box bench run complete instead of hanging on gpt-5 or failing because OpenAI/Claude credits are exhausted, by defaulting every runtime call to a working, cheap model that is available on the Tangle router.
- Assessment: The direction is right and the change is low-risk because every default remains overridable by environment variables. It is not fully coherent, though: it scatters the same deepseek-v4-flash literal everywhere, it leaves some runtime defaults on gpt-5/claude-sonnet-4-6/gpt-4.1, and it keeps the in-box WORKER_PROVIDER default as openai even where the new default model is a cheap router model that t
- Better / existing approach: Introduce a shared constants module (e.g., bench/src/defaults.ts exporting DEFAULT_WORKER_MODEL, DEFAULT_JUDGE_MODEL, and DEFAULT_INBOX_PROVIDER for cheap router models) and import it everywhere. Pair deepseek-v4-flash defaults with provider 'openai-compat' in in-box paths, matching commit0-gate.mts:369 and clbench-codebase-gate.mts:190 and the warning at experiment.ts:212-214. I searched bench/sr
🎯 Usefulness — sound-with-nits
Switches bench defaults to the working, cheap router model in the grain of the harness policy, but leaves four reachable entry points (including the documented improve-prompt script) with old gpt/claude defaults that still hit the hang/credit failures the PR is fixing.
- Integration: The 24 changed files are all reachable bench entry points or shared defaults: bench/src/profiles.ts is imported by bench/src/search-bench/bridge.ts:16-17 and bench/src/search-bench/run.mts:22, so the profile defaults propagate. The new
deepseek-v4-flashdefault matches the canonical harness model policy (bench/HARNESS.md:92-95), and env overrides are preserved so the behavior remains opt-out-abl - Fit with existing patterns: Fits the established pattern rather than competing with it. bench/HARNESS.md already prescribes
deepseek-v4-pro/deepseek-v4-flashas the canonical cheap-router defaults and says 'never CC models'. This change consolidates the codebase onto that documented policy instead of introducing a new one. - Real-world viability: The new defaults are described as cheap and routable on the Tangle router, so they should survive normal concurrency and avoid credit exhaustion. Error paths are unchanged and env overrides still work. The realistic-use gap is that several reachable commands were not migrated and still default to gpt/claude, so running them without an explicit model env hits the original hang/credit failures.
🔎 Heuristic Signals
🟡 Cruft: console debug added bench/src/run.ts
- console.log(
[solve-web-live] ${task.id}: "${goal}" @ ${startUrl} with ${process.env.WORKER_MODEL ?? 'deepseek-v4-flash'}…)
💰 Value Audit
🔴 In-box provider default is wrong for the new deepseek-v4-flash default [against-grain] ``
experiment.ts:212-214 explicitly warns that cheap router models like deepseek 'are not in opencode's openai registry and 404 in-box — pass openai-compat'. run.ts:206 and run.ts:299 still default provider to process.env.WORKER_PROVIDER ?? 'openai' while now defaulting model to deepseek-v4-flash, and worker.ts:104 falls back to 'openai' when provider is omitted. This likely breaks the default solve-one/solve-cad path. commit0-gate.mts:369 and clbench-codebase-gate.mts:190 already default to 'opena
🟠 The same model literal is repeated in ~24 files instead of a shared constant [better-architecture] ``
grep shows process.env.WORKER_MODEL ?? 'deepseek-v4-flash' duplicated across aec-gate.mts, run.ts, keystone-gate-cli.mts, rsi.ts, research-loop.mts, skills-sandbox.mts, etc. The prior commit 5e45c88 had to perform the same multi-file literal swap from gpt-5 to gpt-4.1. A single exported DEFAULT_WORKER_MODEL (and DEFAULT_JUDGE_MODEL) would make the next router/credit swap a one-line change.
🟡 A few runtime defaults still point to gpt-5 / claude-sonnet-4-6 / gpt-4.1 [proportion] ``
improve-prompt.ts:163 still defaults to claude-sonnet-4-6 or gpt-5 depending on flags, and commit0-gate.mts:352 still defaults to gpt-4.1 for the sandbox backend. The PR claims it flipped 'every runtime + profile default'; these either need to be aligned or explicitly documented as intentional exceptions.
🎯 Usefulness Audit
🟠 Four reachable bench entry points still default to gpt/claude [integration] ``
The PR claims it flipped 'every runtime + profile default', but bench/src/commit0-gate.mts:352 (sandbox backend defaults to 'gpt-4.1'), bench/src/fleet.mts:77/82 (MODEL/OBSERVER_MODEL default to 'gpt-4.1'), bench/src/cloud-loop.mts:71 (MODEL defaults to 'gpt-4.1'), and bench/src/improve-prompt.ts:162-163 (SANDBOX path defaults to 'gpt-5', scoreBased domains default to 'claude-sonnet-4-6') still use the old models.
improve-promptis a documented pnpm script (bench/package.json:15; bench/src/run
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 82 | 76 | 76 |
| Confidence | 65 | 65 | 65 |
| Correctness | 82 | 76 | 76 |
| Security | 82 | 76 | 76 |
| Testing | 82 | 76 | 76 |
| Architecture | 82 | 76 | 76 |
Full multi-shot audit completed 1/1 planned shots over 24 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 24 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM Incomplete coverage: improve-prompt.ts line 163 still defaults to claude-sonnet-4-6 and gpt-5 — bench/src/improve-prompt.ts
Line 163:
process.env.WORKER_MODEL ?? (scoreBased ? 'claude-sonnet-4-6' : useSandbox ? 'gpt-5' : 'deepseek/deepseek-v4-pro'). The commit message says 'default ALL bench models to deepseek-v4-flash' but this conditional default in the SAME file was not updated. For score-based benchmarks (CAD, CADBench, Mind2Web, AppWorld) it still defaults to claude-sonnet-4-6; for sandbox mode it defaults to gpt-5. If the PR's intent is cost reduction (credits out), these paths will fail when hit without env override. Fix: change toprocess.env.WORKER_MODEL ?? 'deepseek-v4-flash'or at minimum update the conditional branches.
🟠 MEDIUM Model default changed to deepseek-v4-flash but provider default still 'openai' — documented to 404 in-box — [bench/src/run.ts, bench/src/browser/adapters/bad.ts](https://github.com/tangle-network/agent-runtime/blob/013fbecc0f034579b370347e36f3a4d5f6eb6b24/bench/src/run.ts, bench/src/browser/adapters/bad.ts#L205)
Evidence: (1) run.ts:206:
provider: process.env.WORKER_PROVIDER ?? 'openai'unchanged, but model default on [line 205](https://github.com/tangle-network/agent-runtime/blob/013fbecc0f034579b370347e36f3a4d5f6eb6b24/bench/src/run.ts, bench/src/browser/adapters/bad.ts#L205) changed from'gpt-5'to'deepseek-v4-flash'. Same pattern at run.ts:299. (2) bad.ts:123:cfg.provider ?? 'openai'unchanged, but model default on [line 124](https://github.com/tangle-network/agent-runtime/blob/013fbecc0f034579b370347e36f3a4d5f6eb6b24/bench/src/run.ts, bench/src/browser/adapters/bad.ts#L124) changed from'gpt-4o'to'deepseek-v4-flash'. (3) experiment.ts:212-214 explicitly states: "Cheap router models (deepseek/kimi/glm) are not in opencode's 'openai' registry and 404 in-box — pass 'openai-compat
🟡 LOW Stale JSDoc comments claiming wrong default model names — bench/src/generate-eval/certify.ts
Line 21: comment says
EVAL_GATE_MODEL (default gpt-4.1)but code at line 122 now defaults todeepseek-v4-flash. Similarly, bench/src/benchmarks/cadbench.ts:11 saysJUDGE_MODEL (default gpt-4o)(code now deepseek-v4-flash), and bench/src/run-benchmarks.ts:32 saysdefaults to the profile's default model, then gpt-5(code now deepseek-v4-flash). These are actively misleading to anyone reading the docs without checking the code. Fix: update the comment defaults to match.
🟡 LOW Stale comment in certify.ts: model default documented as gpt-4.1 but code is deepseek-v4-flash — bench/src/generate-eval/certify.ts
Line 22:
* EVAL_GATE_MODEL (default gpt-4.1)— but lines 122 and 145 now default to'deepseek-v4-flash'via the?? 'deepseek-v4-flash'fallback. The comment is stale and misleading. Fix: update the comment to match the code default.
🟡 LOW Stale usage-example comments reference old model names across 8 files — bench/src/humaneval-repair-gate.mts
humaneval-repair-gate.mts:19 shows
WORKER_MODEL=gpt-3.5-turbo, math-demo.mts:10 and strategy-demo.mts:10 showWORKER_MODEL=gpt-4o-mini, profile-coord-sandbox.mts:14 showsWORKER_MODEL=gpt-4.1, research-gate.mts:24-25 showsMODEL=gpt-4o-miniandJUDGE_MODEL=gpt-4o-mini, research-loop.mts:11-12 same, parametric-check.mts:10 showsMODEL=gpt-4.1, skills-sandbox.mts:12 showsWORKER_MODEL=gpt-4.1. These are usage examples (not claiming defaults) so are lower impact, but are inconsistent with the PR's standardization goal. Fix: update example commands to match new defaults.
🟡 LOW Self-preference bias risk: same model as default for both worker and judge — bench/src/profiles.ts
profiles.ts now defaults worker, analyst, and driver profiles all to 'deepseek-v4-flash'. Meanwhile cadbench.ts, finsearchcomp.ts, and simpleqa.ts also default their JUDGE_MODEL to 'deepseek-v4-flash'. When neither JUDGE_MODEL nor WORKER_MODEL is set, the judge and the judged are the same model, introducing self-preference bias. Previously, defaults differentiated roles (e.g., worker=gpt-5, judge=gpt-4o). This is an acceptable cost-driven tradeoff since all are env-overridable, but benchmark results produced with these defaults should be flagged as using same-model judging.
tangletools · 2026-06-14T15:08:07Z · trace
Two pre-existing problems with bench default models, fixed together: (1)
gpt-5defaults hang (30s+ timeout) — a bench without an explicit model stalled on task 1; (2) OpenAI + Claude credits are exhausted, so everygpt-*/claude-*default fails via the router. Flipped every runtime + profile default (gpt-5/gpt-4.1/gpt-4o/gpt-4o-mini/gpt-3.5-turbo/claude-sonnet-4-6) →deepseek-v4-flash(working, cheap, on the Tangle router). Env override unchanged; gemini judges + test fixtures untouched. Capability-preserving.