fix(bench): default all models to deepseek-v4-flash (gpt-5 hangs; OpenAI+Claude credits out) by drewstone · Pull Request #297 · tangle-network/agent-runtime

drewstone · 2026-06-14T14:51:18Z

Two pre-existing problems with bench default models, fixed together: (1) gpt-5 defaults hang (30s+ timeout) — a bench without an explicit model stalled on task 1; (2) OpenAI + Claude credits are exhausted, so every gpt-*/claude-* default fails via the router. Flipped every runtime + profile default (gpt-5/gpt-4.1/gpt-4o/gpt-4o-mini/gpt-3.5-turbo/claude-sonnet-4-6) → deepseek-v4-flash (working, cheap, on the Tangle router). Env override unchanged; gemini judges + test fixtures untouched. Capability-preserving.

…pt-5 hangs) gpt-5 via the router hangs (no response, 30s+ timeout) — a bench launched without an explicit WORKER_MODEL/JUDGE_MODEL silently stalled on the first task. Flip every runtime default + default profile model from gpt-5 to gpt-4.1 (a working model). Env override is unchanged; this only fixes the broken default. Test-fixture model labels (corpus.test.mts) left as-is (not live calls).

tangletools

✅ Auto-approved PR — `5e45c88d`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T14:51:25Z}

…ude credits out) OpenAI and Claude credits are exhausted, so every gpt-*/claude-* default fails via the router. Flip all runtime + profile default models (gpt-4.1, gpt-4o, gpt-4o-mini, gpt-3.5-turbo, gpt-5, claude-sonnet-4-6) to deepseek-v4-flash — a working, cheap model on the Tangle router (returns content, hybrid reasoning). Env override unchanged; gemini judges left as-is (separate credits). Supersedes the gpt-4.1 default from the prior commit.

tangletools

✅ Auto-approved PR — `013fbecc`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T14:54:56Z}

tangletools

🟠 Value Audit — better-approach-exists


Verdict	better-approach-exists
Concerns	5 (1 strong-concern, 2 medium-concern, 1 low, 1 weak-concern)
Heuristic	0.0s
Duplication	0.0s
Interrogation	260.9s (2 bridge agents)
Total	260.9s

💰 Value — better-approach-exists

Switches bench defaults to deepseek-v4-flash to dodge gpt-5 hangs and exhausted OpenAI/Claude credits, but leaves provider pairings and a few other defaults inconsistent and repeats the new literal across two dozen files instead of centralizing it.

What it does: Replaces the hardcoded default model in 24 bench scripts and in bench/src/profiles.ts from gpt-5/gpt-4.1/gpt-4o/gpt-4o-mini/gpt-3.5-turbo/claude-sonnet-4-6 to deepseek-v4-flash, while preserving env overrides (WORKER_MODEL, JUDGE_MODEL, REFLECT_MODEL, MODEL, EVAL_GATE_MODEL).
Goals it achieves: Make an out-of-the-box bench run complete instead of hanging on gpt-5 or failing because OpenAI/Claude credits are exhausted, by defaulting every runtime call to a working, cheap model that is available on the Tangle router.
Assessment: The direction is right and the change is low-risk because every default remains overridable by environment variables. It is not fully coherent, though: it scatters the same deepseek-v4-flash literal everywhere, it leaves some runtime defaults on gpt-5/claude-sonnet-4-6/gpt-4.1, and it keeps the in-box WORKER_PROVIDER default as openai even where the new default model is a cheap router model that t
Better / existing approach: Introduce a shared constants module (e.g., bench/src/defaults.ts exporting DEFAULT_WORKER_MODEL, DEFAULT_JUDGE_MODEL, and DEFAULT_INBOX_PROVIDER for cheap router models) and import it everywhere. Pair deepseek-v4-flash defaults with provider 'openai-compat' in in-box paths, matching commit0-gate.mts:369 and clbench-codebase-gate.mts:190 and the warning at experiment.ts:212-214. I searched bench/sr

🎯 Usefulness — sound-with-nits

Switches bench defaults to the working, cheap router model in the grain of the harness policy, but leaves four reachable entry points (including the documented improve-prompt script) with old gpt/claude defaults that still hit the hang/credit failures the PR is fixing.

Integration: The 24 changed files are all reachable bench entry points or shared defaults: bench/src/profiles.ts is imported by bench/src/search-bench/bridge.ts:16-17 and bench/src/search-bench/run.mts:22, so the profile defaults propagate. The new deepseek-v4-flash default matches the canonical harness model policy (bench/HARNESS.md:92-95), and env overrides are preserved so the behavior remains opt-out-abl
Fit with existing patterns: Fits the established pattern rather than competing with it. bench/HARNESS.md already prescribes deepseek-v4-pro/deepseek-v4-flash as the canonical cheap-router defaults and says 'never CC models'. This change consolidates the codebase onto that documented policy instead of introducing a new one.
Real-world viability: The new defaults are described as cheap and routable on the Tangle router, so they should survive normal concurrency and avoid credit exhaustion. Error paths are unchanged and env overrides still work. The realistic-use gap is that several reachable commands were not migrated and still default to gpt/claude, so running them without an explicit model env hits the original hang/credit failures.

🔎 Heuristic Signals

🟡 Cruft: console debug added bench/src/run.ts

console.log([solve-web-live] ${task.id}: "${goal}" @ ${startUrl} with ${process.env.WORKER_MODEL ?? 'deepseek-v4-flash'}…)

💰 Value Audit

🔴 In-box provider default is wrong for the new deepseek-v4-flash default [against-grain] ``

experiment.ts:212-214 explicitly warns that cheap router models like deepseek 'are not in opencode's openai registry and 404 in-box — pass openai-compat'. run.ts:206 and run.ts:299 still default provider to process.env.WORKER_PROVIDER ?? 'openai' while now defaulting model to deepseek-v4-flash, and worker.ts:104 falls back to 'openai' when provider is omitted. This likely breaks the default solve-one/solve-cad path. commit0-gate.mts:369 and clbench-codebase-gate.mts:190 already default to 'opena

🟠 The same model literal is repeated in ~24 files instead of a shared constant [better-architecture] ``

grep shows process.env.WORKER_MODEL ?? 'deepseek-v4-flash' duplicated across aec-gate.mts, run.ts, keystone-gate-cli.mts, rsi.ts, research-loop.mts, skills-sandbox.mts, etc. The prior commit 5e45c88 had to perform the same multi-file literal swap from gpt-5 to gpt-4.1. A single exported DEFAULT_WORKER_MODEL (and DEFAULT_JUDGE_MODEL) would make the next router/credit swap a one-line change.

🟡 A few runtime defaults still point to gpt-5 / claude-sonnet-4-6 / gpt-4.1 [proportion] ``

improve-prompt.ts:163 still defaults to claude-sonnet-4-6 or gpt-5 depending on flags, and commit0-gate.mts:352 still defaults to gpt-4.1 for the sandbox backend. The PR claims it flipped 'every runtime + profile default'; these either need to be aligned or explicitly documented as intentional exceptions.

🎯 Usefulness Audit

🟠 Four reachable bench entry points still default to gpt/claude [integration] ``

The PR claims it flipped 'every runtime + profile default', but bench/src/commit0-gate.mts:352 (sandbox backend defaults to 'gpt-4.1'), bench/src/fleet.mts:77/82 (MODEL/OBSERVER_MODEL default to 'gpt-4.1'), bench/src/cloud-loop.mts:71 (MODEL defaults to 'gpt-4.1'), and bench/src/improve-prompt.ts:162-163 (SANDBOX path defaults to 'gpt-5', scoreBased domains default to 'claude-sonnet-4-6') still use the old models. improve-prompt is a documented pnpm script (bench/package.json:15; bench/src/run

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260614T150515Z}

tangletools · 2026-06-14T15:08:10Z

✅ No Blockers — `013fbecc`

Readiness 76/100 · Confidence 65/100 · 6 findings (2 medium, 4 low)

	deepseek	glm	aggregate
Readiness	82	76	76
Confidence	65	65	65
Correctness	82	76	76
Security	82	76	76
Testing	82	76	76
Architecture	82	76	76

Full multi-shot audit completed 1/1 planned shots over 24 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 24 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Incomplete coverage: improve-prompt.ts line 163 still defaults to claude-sonnet-4-6 and gpt-5 — bench/src/improve-prompt.ts

Line 163: process.env.WORKER_MODEL ?? (scoreBased ? 'claude-sonnet-4-6' : useSandbox ? 'gpt-5' : 'deepseek/deepseek-v4-pro'). The commit message says 'default ALL bench models to deepseek-v4-flash' but this conditional default in the SAME file was not updated. For score-based benchmarks (CAD, CADBench, Mind2Web, AppWorld) it still defaults to claude-sonnet-4-6; for sandbox mode it defaults to gpt-5. If the PR's intent is cost reduction (credits out), these paths will fail when hit without env override. Fix: change to process.env.WORKER_MODEL ?? 'deepseek-v4-flash' or at minimum update the conditional branches.

🟠 MEDIUM Model default changed to deepseek-v4-flash but provider default still 'openai' — documented to 404 in-box — [bench/src/run.ts, bench/src/browser/adapters/bad.ts](https://github.com/tangle-network/agent-runtime/blob/013fbecc0f034579b370347e36f3a4d5f6eb6b24/bench/src/run.ts, bench/src/browser/adapters/bad.ts#L205)

Evidence: (1) run.ts:206: provider: process.env.WORKER_PROVIDER ?? 'openai' unchanged, but model default on [line 205](https://github.com/tangle-network/agent-runtime/blob/013fbecc0f034579b370347e36f3a4d5f6eb6b24/bench/src/run.ts, bench/src/browser/adapters/bad.ts#L205) changed from 'gpt-5' to 'deepseek-v4-flash'. Same pattern at run.ts:299. (2) bad.ts:123: cfg.provider ?? 'openai' unchanged, but model default on [line 124](https://github.com/tangle-network/agent-runtime/blob/013fbecc0f034579b370347e36f3a4d5f6eb6b24/bench/src/run.ts, bench/src/browser/adapters/bad.ts#L124) changed from 'gpt-4o' to 'deepseek-v4-flash'. (3) experiment.ts:212-214 explicitly states: "Cheap router models (deepseek/kimi/glm) are not in opencode's 'openai' registry and 404 in-box — pass 'openai-compat

🟡 LOW Stale JSDoc comments claiming wrong default model names — bench/src/generate-eval/certify.ts

Line 21: comment says EVAL_GATE_MODEL (default gpt-4.1) but code at line 122 now defaults to deepseek-v4-flash. Similarly, bench/src/benchmarks/cadbench.ts:11 says JUDGE_MODEL (default gpt-4o) (code now deepseek-v4-flash), and bench/src/run-benchmarks.ts:32 says defaults to the profile's default model, then gpt-5 (code now deepseek-v4-flash). These are actively misleading to anyone reading the docs without checking the code. Fix: update the comment defaults to match.

🟡 LOW Stale comment in certify.ts: model default documented as gpt-4.1 but code is deepseek-v4-flash — bench/src/generate-eval/certify.ts

Line 22: * EVAL_GATE_MODEL (default gpt-4.1) — but lines 122 and 145 now default to 'deepseek-v4-flash' via the ?? 'deepseek-v4-flash' fallback. The comment is stale and misleading. Fix: update the comment to match the code default.

🟡 LOW Stale usage-example comments reference old model names across 8 files — bench/src/humaneval-repair-gate.mts

humaneval-repair-gate.mts:19 shows WORKER_MODEL=gpt-3.5-turbo, math-demo.mts:10 and strategy-demo.mts:10 show WORKER_MODEL=gpt-4o-mini, profile-coord-sandbox.mts:14 shows WORKER_MODEL=gpt-4.1, research-gate.mts:24-25 shows MODEL=gpt-4o-mini and JUDGE_MODEL=gpt-4o-mini, research-loop.mts:11-12 same, parametric-check.mts:10 shows MODEL=gpt-4.1, skills-sandbox.mts:12 shows WORKER_MODEL=gpt-4.1. These are usage examples (not claiming defaults) so are lower impact, but are inconsistent with the PR's standardization goal. Fix: update example commands to match new defaults.

🟡 LOW Self-preference bias risk: same model as default for both worker and judge — bench/src/profiles.ts

profiles.ts now defaults worker, analyst, and driver profiles all to 'deepseek-v4-flash'. Meanwhile cadbench.ts, finsearchcomp.ts, and simpleqa.ts also default their JUDGE_MODEL to 'deepseek-v4-flash'. When neither JUDGE_MODEL nor WORKER_MODEL is set, the judge and the judged are the same model, introducing self-preference bias. Previously, defaults differentiated roles (e.g., worker=gpt-5, judge=gpt-4o). This is an acceptable cost-driven tradeoff since all are env-overridable, but benchmark results produced with these defaults should be flagged as using same-model judging.

_{tangletools · 2026-06-14T15:08:07Z · trace}

tangletools previously approved these changes Jun 14, 2026

View reviewed changes

drewstone dismissed tangletools’s stale review via 013fbec June 14, 2026 14:54

tangletools approved these changes Jun 14, 2026

View reviewed changes

drewstone changed the title ~~fix(bench): default models to gpt-4.1 not gpt-5 (gpt-5 hangs)~~ fix(bench): default all models to deepseek-v4-flash (gpt-5 hangs; OpenAI+Claude credits out) Jun 14, 2026

tangletools reviewed Jun 14, 2026

View reviewed changes

drewstone merged commit 8f8ec99 into main Jun 14, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bench): default all models to deepseek-v4-flash (gpt-5 hangs; OpenAI+Claude credits out)#297

fix(bench): default all models to deepseek-v4-flash (gpt-5 hangs; OpenAI+Claude credits out)#297
drewstone merged 2 commits into
mainfrom
fix/gpt5-hanging-defaults

drewstone commented Jun 14, 2026 •

edited

Loading

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools commented Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 5e45c88d

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 013fbecc

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟠 Value Audit — better-approach-exists

💰 Value — better-approach-exists

🎯 Usefulness — sound-with-nits

🔎 Heuristic Signals

💰 Value Audit

🎯 Usefulness Audit

Uh oh!

tangletools commented Jun 14, 2026

✅ No Blockers — 013fbecc

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drewstone commented Jun 14, 2026 •

edited

Loading

✅ Auto-approved PR — `5e45c88d`

✅ Auto-approved PR — `013fbecc`

✅ No Blockers — `013fbecc`