Skip to content

fix(bench): default all models to deepseek-v4-flash (gpt-5 hangs; OpenAI+Claude credits out)#297

Merged
drewstone merged 2 commits into
mainfrom
fix/gpt5-hanging-defaults
Jun 14, 2026
Merged

fix(bench): default all models to deepseek-v4-flash (gpt-5 hangs; OpenAI+Claude credits out)#297
drewstone merged 2 commits into
mainfrom
fix/gpt5-hanging-defaults

Conversation

@drewstone

@drewstone drewstone commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Two pre-existing problems with bench default models, fixed together: (1) gpt-5 defaults hang (30s+ timeout) — a bench without an explicit model stalled on task 1; (2) OpenAI + Claude credits are exhausted, so every gpt-*/claude-* default fails via the router. Flipped every runtime + profile default (gpt-5/gpt-4.1/gpt-4o/gpt-4o-mini/gpt-3.5-turbo/claude-sonnet-4-6) → deepseek-v4-flash (working, cheap, on the Tangle router). Env override unchanged; gemini judges + test fixtures untouched. Capability-preserving.

…pt-5 hangs)

gpt-5 via the router hangs (no response, 30s+ timeout) — a bench launched without an
explicit WORKER_MODEL/JUDGE_MODEL silently stalled on the first task. Flip every runtime
default + default profile model from gpt-5 to gpt-4.1 (a working model). Env override is
unchanged; this only fixes the broken default. Test-fixture model labels (corpus.test.mts)
left as-is (not live calls).
tangletools
tangletools previously approved these changes Jun 14, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 5e45c88d

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T14:51:25Z

…ude credits out)

OpenAI and Claude credits are exhausted, so every gpt-*/claude-* default fails via the
router. Flip all runtime + profile default models (gpt-4.1, gpt-4o, gpt-4o-mini,
gpt-3.5-turbo, gpt-5, claude-sonnet-4-6) to deepseek-v4-flash — a working, cheap model
on the Tangle router (returns content, hybrid reasoning). Env override unchanged; gemini
judges left as-is (separate credits). Supersedes the gpt-4.1 default from the prior commit.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 013fbecc

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T14:54:56Z

@drewstone drewstone changed the title fix(bench): default models to gpt-4.1 not gpt-5 (gpt-5 hangs) fix(bench): default all models to deepseek-v4-flash (gpt-5 hangs; OpenAI+Claude credits out) Jun 14, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Value Audit — better-approach-exists

Verdict better-approach-exists
Concerns 5 (1 strong-concern, 2 medium-concern, 1 low, 1 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 260.9s (2 bridge agents)
Total 260.9s

💰 Value — better-approach-exists

Switches bench defaults to deepseek-v4-flash to dodge gpt-5 hangs and exhausted OpenAI/Claude credits, but leaves provider pairings and a few other defaults inconsistent and repeats the new literal across two dozen files instead of centralizing it.

  • What it does: Replaces the hardcoded default model in 24 bench scripts and in bench/src/profiles.ts from gpt-5/gpt-4.1/gpt-4o/gpt-4o-mini/gpt-3.5-turbo/claude-sonnet-4-6 to deepseek-v4-flash, while preserving env overrides (WORKER_MODEL, JUDGE_MODEL, REFLECT_MODEL, MODEL, EVAL_GATE_MODEL).
  • Goals it achieves: Make an out-of-the-box bench run complete instead of hanging on gpt-5 or failing because OpenAI/Claude credits are exhausted, by defaulting every runtime call to a working, cheap model that is available on the Tangle router.
  • Assessment: The direction is right and the change is low-risk because every default remains overridable by environment variables. It is not fully coherent, though: it scatters the same deepseek-v4-flash literal everywhere, it leaves some runtime defaults on gpt-5/claude-sonnet-4-6/gpt-4.1, and it keeps the in-box WORKER_PROVIDER default as openai even where the new default model is a cheap router model that t
  • Better / existing approach: Introduce a shared constants module (e.g., bench/src/defaults.ts exporting DEFAULT_WORKER_MODEL, DEFAULT_JUDGE_MODEL, and DEFAULT_INBOX_PROVIDER for cheap router models) and import it everywhere. Pair deepseek-v4-flash defaults with provider 'openai-compat' in in-box paths, matching commit0-gate.mts:369 and clbench-codebase-gate.mts:190 and the warning at experiment.ts:212-214. I searched bench/sr

🎯 Usefulness — sound-with-nits

Switches bench defaults to the working, cheap router model in the grain of the harness policy, but leaves four reachable entry points (including the documented improve-prompt script) with old gpt/claude defaults that still hit the hang/credit failures the PR is fixing.

  • Integration: The 24 changed files are all reachable bench entry points or shared defaults: bench/src/profiles.ts is imported by bench/src/search-bench/bridge.ts:16-17 and bench/src/search-bench/run.mts:22, so the profile defaults propagate. The new deepseek-v4-flash default matches the canonical harness model policy (bench/HARNESS.md:92-95), and env overrides are preserved so the behavior remains opt-out-abl
  • Fit with existing patterns: Fits the established pattern rather than competing with it. bench/HARNESS.md already prescribes deepseek-v4-pro/deepseek-v4-flash as the canonical cheap-router defaults and says 'never CC models'. This change consolidates the codebase onto that documented policy instead of introducing a new one.
  • Real-world viability: The new defaults are described as cheap and routable on the Tangle router, so they should survive normal concurrency and avoid credit exhaustion. Error paths are unchanged and env overrides still work. The realistic-use gap is that several reachable commands were not migrated and still default to gpt/claude, so running them without an explicit model env hits the original hang/credit failures.

🔎 Heuristic Signals

🟡 Cruft: console debug added bench/src/run.ts

  • console.log([solve-web-live] ${task.id}: "${goal}" @ ${startUrl} with ${process.env.WORKER_MODEL ?? 'deepseek-v4-flash'}…)

💰 Value Audit

🔴 In-box provider default is wrong for the new deepseek-v4-flash default [against-grain] ``

experiment.ts:212-214 explicitly warns that cheap router models like deepseek 'are not in opencode's openai registry and 404 in-box — pass openai-compat'. run.ts:206 and run.ts:299 still default provider to process.env.WORKER_PROVIDER ?? 'openai' while now defaulting model to deepseek-v4-flash, and worker.ts:104 falls back to 'openai' when provider is omitted. This likely breaks the default solve-one/solve-cad path. commit0-gate.mts:369 and clbench-codebase-gate.mts:190 already default to 'opena

🟠 The same model literal is repeated in ~24 files instead of a shared constant [better-architecture] ``

grep shows process.env.WORKER_MODEL ?? 'deepseek-v4-flash' duplicated across aec-gate.mts, run.ts, keystone-gate-cli.mts, rsi.ts, research-loop.mts, skills-sandbox.mts, etc. The prior commit 5e45c88 had to perform the same multi-file literal swap from gpt-5 to gpt-4.1. A single exported DEFAULT_WORKER_MODEL (and DEFAULT_JUDGE_MODEL) would make the next router/credit swap a one-line change.

🟡 A few runtime defaults still point to gpt-5 / claude-sonnet-4-6 / gpt-4.1 [proportion] ``

improve-prompt.ts:163 still defaults to claude-sonnet-4-6 or gpt-5 depending on flags, and commit0-gate.mts:352 still defaults to gpt-4.1 for the sandbox backend. The PR claims it flipped 'every runtime + profile default'; these either need to be aligned or explicitly documented as intentional exceptions.

🎯 Usefulness Audit

🟠 Four reachable bench entry points still default to gpt/claude [integration] ``

The PR claims it flipped 'every runtime + profile default', but bench/src/commit0-gate.mts:352 (sandbox backend defaults to 'gpt-4.1'), bench/src/fleet.mts:77/82 (MODEL/OBSERVER_MODEL default to 'gpt-4.1'), bench/src/cloud-loop.mts:71 (MODEL defaults to 'gpt-4.1'), and bench/src/improve-prompt.ts:162-163 (SANDBOX path defaults to 'gpt-5', scoreBased domains default to 'claude-sonnet-4-6') still use the old models. improve-prompt is a documented pnpm script (bench/package.json:15; bench/src/run


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260614T150515Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 013fbecc

Readiness 76/100 · Confidence 65/100 · 6 findings (2 medium, 4 low)

deepseek glm aggregate
Readiness 82 76 76
Confidence 65 65 65
Correctness 82 76 76
Security 82 76 76
Testing 82 76 76
Architecture 82 76 76

Full multi-shot audit completed 1/1 planned shots over 24 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 24 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Incomplete coverage: improve-prompt.ts line 163 still defaults to claude-sonnet-4-6 and gpt-5 — bench/src/improve-prompt.ts

Line 163: process.env.WORKER_MODEL ?? (scoreBased ? 'claude-sonnet-4-6' : useSandbox ? 'gpt-5' : 'deepseek/deepseek-v4-pro'). The commit message says 'default ALL bench models to deepseek-v4-flash' but this conditional default in the SAME file was not updated. For score-based benchmarks (CAD, CADBench, Mind2Web, AppWorld) it still defaults to claude-sonnet-4-6; for sandbox mode it defaults to gpt-5. If the PR's intent is cost reduction (credits out), these paths will fail when hit without env override. Fix: change to process.env.WORKER_MODEL ?? 'deepseek-v4-flash' or at minimum update the conditional branches.

🟠 MEDIUM Model default changed to deepseek-v4-flash but provider default still 'openai' — documented to 404 in-box — [bench/src/run.ts, bench/src/browser/adapters/bad.ts](https://github.com/tangle-network/agent-runtime/blob/013fbecc0f034579b370347e36f3a4d5f6eb6b24/bench/src/run.ts, bench/src/browser/adapters/bad.ts#L205)

Evidence: (1) run.ts:206: provider: process.env.WORKER_PROVIDER ?? 'openai' unchanged, but model default on [line 205](https://github.com/tangle-network/agent-runtime/blob/013fbecc0f034579b370347e36f3a4d5f6eb6b24/bench/src/run.ts, bench/src/browser/adapters/bad.ts#L205) changed from 'gpt-5' to 'deepseek-v4-flash'. Same pattern at run.ts:299. (2) bad.ts:123: cfg.provider ?? 'openai' unchanged, but model default on [line 124](https://github.com/tangle-network/agent-runtime/blob/013fbecc0f034579b370347e36f3a4d5f6eb6b24/bench/src/run.ts, bench/src/browser/adapters/bad.ts#L124) changed from 'gpt-4o' to 'deepseek-v4-flash'. (3) experiment.ts:212-214 explicitly states: "Cheap router models (deepseek/kimi/glm) are not in opencode's 'openai' registry and 404 in-box — pass 'openai-compat

🟡 LOW Stale JSDoc comments claiming wrong default model names — bench/src/generate-eval/certify.ts

Line 21: comment says EVAL_GATE_MODEL (default gpt-4.1) but code at line 122 now defaults to deepseek-v4-flash. Similarly, bench/src/benchmarks/cadbench.ts:11 says JUDGE_MODEL (default gpt-4o) (code now deepseek-v4-flash), and bench/src/run-benchmarks.ts:32 says defaults to the profile's default model, then gpt-5 (code now deepseek-v4-flash). These are actively misleading to anyone reading the docs without checking the code. Fix: update the comment defaults to match.

🟡 LOW Stale comment in certify.ts: model default documented as gpt-4.1 but code is deepseek-v4-flash — bench/src/generate-eval/certify.ts

Line 22: * EVAL_GATE_MODEL (default gpt-4.1) — but lines 122 and 145 now default to 'deepseek-v4-flash' via the ?? 'deepseek-v4-flash' fallback. The comment is stale and misleading. Fix: update the comment to match the code default.

🟡 LOW Stale usage-example comments reference old model names across 8 files — bench/src/humaneval-repair-gate.mts

humaneval-repair-gate.mts:19 shows WORKER_MODEL=gpt-3.5-turbo, math-demo.mts:10 and strategy-demo.mts:10 show WORKER_MODEL=gpt-4o-mini, profile-coord-sandbox.mts:14 shows WORKER_MODEL=gpt-4.1, research-gate.mts:24-25 shows MODEL=gpt-4o-mini and JUDGE_MODEL=gpt-4o-mini, research-loop.mts:11-12 same, parametric-check.mts:10 shows MODEL=gpt-4.1, skills-sandbox.mts:12 shows WORKER_MODEL=gpt-4.1. These are usage examples (not claiming defaults) so are lower impact, but are inconsistent with the PR's standardization goal. Fix: update example commands to match new defaults.

🟡 LOW Self-preference bias risk: same model as default for both worker and judge — bench/src/profiles.ts

profiles.ts now defaults worker, analyst, and driver profiles all to 'deepseek-v4-flash'. Meanwhile cadbench.ts, finsearchcomp.ts, and simpleqa.ts also default their JUDGE_MODEL to 'deepseek-v4-flash'. When neither JUDGE_MODEL nor WORKER_MODEL is set, the judge and the judged are the same model, introducing self-preference bias. Previously, defaults differentiated roles (e.g., worker=gpt-5, judge=gpt-4o). This is an acceptable cost-driven tradeoff since all are env-overridable, but benchmark results produced with these defaults should be flagged as using same-model judging.


tangletools · 2026-06-14T15:08:07Z · trace

@drewstone drewstone merged commit 8f8ec99 into main Jun 14, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants