fix(produced-state): faithful completion oracle — content-aware, ungameable, form-vocabulary-robust by drewstone · Pull Request #248 · tangle-network/agent-eval

drewstone · 2026-06-14T02:13:58Z

Why

The produced-state completion oracle (verifyCompletion) is the eval's anti-proxy guarantee: a score only means something if it confirms the agent actually did the regulated work — filed the proposal, wrote the refusal note, rendered the comparison — not described it in prose. Three defects let it lie in both directions.

False negatives (the dominant one). On a real insurance canonical run (c9115ded), the agent produced 16 OpenUI artifacts, 55 transcript proposals, 17 call-note proposals — and scored 0/15 personas. The matcher keyed on path+kind (artifacts) and title-only (proposals), blind to content; and a requirement worded as a deliverable shape ("Generated swap-comparison view persisted as a ui/** artifact") recalled only 3/7 tokens (0.429, below threshold) because generated/view/persisted/artifact are form words the JSON body never echoes.

False positives (the gameable seam). The two-stage check (structural → correctness) collapsed to one lexical gate: a title-only proposal (content=null) auto-passed correctness (correct=null → satisfied), and pairing the structural token-recall with the lexical token-recall correctness checker let a negation that contains the requirement's tokens ("I will NOT do the swap analysis") pass both stages.

What

Require an assessable body for a proposal to count (≥ MIN_CONTENT_CHARS). Tool calls remain the only legitimately content-less deliverable. Closes the title-only bypass.
Strip deliverable-FORM vocabulary (generated, view, persisted, artifact, note, proposal, flag, …) from the requirement side of token recall, so a deliverable matches on what it is about (the domain nouns) not the boilerplate naming its form. Mirrors TITLE_STOPWORDS in the correctness checker.
Document the polarity-blindness of createTokenRecallChecker and demote it to an opt-in structural pre-filter: produced-state correctness MUST be semantic (createLlmCorrectnessChecker). The structural stage is lexical and cannot, alone, catch negation.
Anti-game fixtures pin all of it: title-only fails, thin-body fails, off-topic (even with body / even after form-strip) fails, and a token-complete negation is caught by the semantic checker — and (the documented reason) sails past the lexical one.

Proof

Offline re-grade of the blind run c9115ded against the hardened oracle:

persona	old	hardened	note
openui-swap-comparison	0.00	1.00	16 OpenUI artifacts now credited
other 14 personas	0.00	0.00	no false un-blinding (anti-game holds)

The other 14 stay at 0.00 correctly — their proposals were captured before the body-threading extractor (this branch's 6b144ae), so they carry no body to grade. The proposal-content path is unit-proven (the anti-game + routing fixtures use content-bearing proposals) but its live capture is validated by a fresh run, not this old artifact.

2275 tests green · typecheck clean · build clean.

Scope / rollout

This is an NS-1 faithfulness change to a fleet-shared substrate (insurance, legal, gtm, creative, agent-builder all consume verifyCompletion). It is stricter (require body, strip form words), so consumers should re-baseline before any fleet bump — scores that previously inflated on title-only / form-vocabulary matches will correct downward; genuine produced deliverables will newly count. Not published in this PR.

…ss-checks proposals ProposalEventLike gains an optional content field; extractProducedState threads it into ProducedProposal.content (which verifyCompletion already reads). Without this a producer that emits a proposal body had it dropped, so the completion oracle could only grade proposals on presence + title match — never on whether the body is the right deliverable. Additive and backward-compatible: a proposal with no content is still graded presence-only.

…ess) The completion oracle matched proposals on title-only and artifacts on path+kind-only — content-blind. A correctly-produced deliverable with a short label title or a generic path scored 0 even when its body fully covered the requirement (proven: an OpenUI comparison + orphan/refusal proposals scoring 0 on real runs). Include the proposal/artifact BODY in the token-recall match so the real deliverable is credited; MATCH_THRESHOLD + the requirement's distinctive tokens keep an off-topic item from matching (anti-game tests added). NS-1 grader-fidelity work — fleet-wide substrate change: needs cross-product regression on every consumer's eval before a fleet bump. Not yet published.

…contract The two-stage completion oracle (content-aware structural match → correctness) is only ungameable when the correctness stage is SEMANTIC. The structural stage is lexical token recall, so a negation that contains the requirement's tokens matches it; pairing it with the lexical token-recall checker collapses to one gameable gate. - proposalCandidates: require an assessable body (>= MIN_CONTENT_CHARS). A bare title carried no content to correctness-check, so it auto-passed the oracle (structurallyPresent && correct===null -> satisfied). Tool calls remain the only legitimately content-less deliverable. - createTokenRecallChecker: documented polarity-blind; demoted to an opt-in structural pre-filter. Produced-state correctness MUST use the semantic checker. - anti-game fixtures: title-only fails, thin-body fails, a token-complete negation is caught by the semantic checker and (the documented reason) sails past the lexical one. 2273 tests green.

…l recall A requirement worded as a deliverable shape — 'Generated swap-comparison view persisted as a ui/** artifact' — recalled only 3/7 tokens (0.429, below the 0.5 match threshold) against a correct OpenUI artifact, because 'generated', 'view', 'persisted', 'artifact' are form words the JSON body never echoes. The discriminative tokens are the domain nouns (swap, comparison). Strip a small deliverable-FORM stopword set from the REQUIREMENT side of token recall so a deliverable matches on what it is ABOUT, not the boilerplate naming its form. Mirrors TITLE_STOPWORDS in the correctness checker. Anti-game holds: an off-topic artifact still lacks the domain nouns (pinned by a new negative test). Verified on the blind canonical run c9115ded: the openui persona goes 0.00 -> 1.00 (16 OpenUI artifacts credited) with zero false un-blinding on the other 14 personas. 2275 tests green.

tangletools

✅ Auto-approved PR — `cfc4f3ce`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T02:14:05Z}

tangletools

🟢 Value Audit — sound


Verdict	sound
Concerns	0 (none)
Heuristic	0.0s
Duplication	0.0s
Interrogation	30.0s (2 bridge agents)
Total	30.0s

No concerns — sound change, no better or existing approach found. ✅

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260614T024653Z}

tangletools · 2026-06-14T03:05:14Z

⚠️ Review Incomplete — `cfc4f3ce`

At least one required reviewer lane failed closed. No approval or request-changes review was published. Trigger a fresh review on the current PR head.

_{tangletools · 2026-06-14T03:05:12Z}

tangletools

🟢 Value Audit — sound


Verdict	sound
Concerns	0 (none)
Heuristic	0.0s
Duplication	0.0s
Interrogation	30.0s (2 bridge agents)
Total	30.0s

No concerns — sound change, no better or existing approach found. ✅

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260614T033818Z}

tangletools · 2026-06-14T03:49:40Z

✅ No Blockers — `cfc4f3ce`

Readiness 76/100 · Confidence 80/100 · 7 findings (1 medium, 6 low)

	deepseek	glm	aggregate
Readiness	76	86	76
Confidence	80	80	80
Correctness	76	86	76
Security	76	86	76
Testing	76	86	76
Architecture	76	86	76

Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM 'view' in REQUIREMENT_FORM_STOPWORDS may over-strip domain-significant tokens — src/completion-verifier.ts

Line 158: 'view' is included in the deliverable-FORM stopword set. In UI-generation contexts this is correct (e.g. 'Generated swap-comparison view' -> 'swap-comparison' are the domain tokens). However 'view' also appears as a genuine domain noun in requirements like 'Customer Lifecycle View analysis' where stripping it degrades recall. The risk is partially mitigated because stripping is requirement-side only and other domain tokens usually survive, but a requirement consisting solely of ['view', fieldName] could drop below MATCH_THRESHOLD. Consider whether 'view' should be context-gated or documented as a known trade-off.

🟡 LOW Boundary conditions for MATCH_THRESHOLD and MIN_CONTENT_CHARS untested — src/completion-verifier.test.ts

No test exercises exactly MIN_CONTENT_CHARS=50 chars (thin-body test uses 9 chars; approved proposals use ~175 chars) or exactly MATCH_THRESHOLD=0.5 recall. The >= 0.5 vs > 0.5 decision at artifactCandidates:234 determines whether edge-case items match. Not a bug — all production scenarios comfortably clear the thresholds — but a change to strict-inequality (> 0.5) could silently break matching without a test to catch it.

🟡 LOW Non-discriminative token 'as' survives in form-vocabulary test requirement set — src/completion-verifier.test.ts

In the test at line 259-288, the requirement 'Generated swap-comparison view persisted as a ui/** artifact' tokenizes to {swap, comparison, as, ui} after form-stopword stripping. The token 'as' (length 2, not in STOPWORDS or REQUIREMENT_FORM_STOPWORDS) is non-discriminative noise. It does not cause a test failure because the anti-game test at line 290-316 confirms the off-topic artifact still fails recall. This is a pre-existing characteristic of the token function, not introduced by this PR, but the test fixture would be

🟡 LOW Requirements titled entirely with form vocabulary silently match nothing — src/completion-verifier.ts

Lines 195-201: tokenRecall strips REQUIREMENT_FORM_STOPWORDS from the requirement side, then returns 0 when req.size === 0. A requirement title like 'Create a file' or 'Generated output' would have ALL tokens stripped, leaving an empty set → score 0 → no candidate can ever match. This is an inherent design trade-off (such requirements are poorly specified — they should name domain content, not form), and the motivating real-world case ('Generated swap-comparison view persisted as a ui/** artifact') still has 'swap' and 'comparison' as distinctive tokens. But a consumer writing a thin requirement title could see a silent zero with no diagnostic. Consider: when r

🟡 LOW proposalCandidates body guard uses post-trim length for skip but pre-trim for matching — src/completion-verifier.ts

Lines 261-262: The guard if (body.length < MIN_CONTENT_CHARS) continue uses the trimmed body length, which is correct for the gate. But on line 271, ${p.title} ${body} uses the trimmed body — the tokenRecall sees the trimmed body. This is consistent but worth noting: a content string like ' ... 50+ spaces ... ' would be trimmed to empty, skip the guard, and body would be empty in the match text. Since the guard fires first, this is safe in practice, but the control flow would be marginally clearer if the early-return were `if ((p

🟡 LOW Mixed-stream E2E test does not exercise the content path — src/produced-state.test.ts

The 'normalizes a realistic mixed stream end to end' test (line 121-144) includes a proposal_created event without content, so the new content-threading branch is not covered in the integration-style test. The two new unit tests cover the contract directly, so this is coverage completeness only, not a correctness gap.

🟡 LOW content null-leak via structural typing — src/produced-state.ts

Line 107: p.content !== undefined passes null through to output { content: null }, mismatching ProducedProposal.content?: string. The RuntimeEventLike catch-all { type: string } permits structurally-typed input with content: null. Downstream proposalCandidates (completion-verifier.ts:261) mitigates via (p.content ?? '').trim(), so no crash or incorrect verdict. Consider p.content != null or extracting through a content?: string type-guard. Consistent with the rest of the file's structural typing (artifact path uses ?? ''), so low-priority.

_{tangletools · 2026-06-14T03:49:38Z · trace}

…e map Releases the merged-but-unpublished work: - content-aware + ungameable produced-state completion grader (#248): require a body for a structural proposal match, strip deliverable-FORM vocabulary from requirement-side recall, anti-game fixtures. - eval surface map doc (#249): pick-by-use-when for run* + the in-band produced-state contract; retires the phantom persona-dispatch wrapper. Stricter verifyCompletion behavior — a notable minor; consumers adopt by bump.

…e map (#250) Releases the merged-but-unpublished work: - content-aware + ungameable produced-state completion grader (#248): require a body for a structural proposal match, strip deliverable-FORM vocabulary from requirement-side recall, anti-game fixtures. - eval surface map doc (#249): pick-by-use-when for run* + the in-band produced-state contract; retires the phantom persona-dispatch wrapper. Stricter verifyCompletion behavior — a notable minor; consumers adopt by bump.

drewstone added 4 commits June 13, 2026 15:01

tangletools approved these changes Jun 14, 2026

View reviewed changes

tangletools reviewed Jun 14, 2026

View reviewed changes

drewstone merged commit a10c25e into main Jun 14, 2026
1 check passed

drewstone mentioned this pull request Jun 14, 2026

chore(release): agent-eval 0.92.0 — faithful grader + surface map #250

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(produced-state): faithful completion oracle — content-aware, ungameable, form-vocabulary-robust#248

fix(produced-state): faithful completion oracle — content-aware, ungameable, form-vocabulary-robust#248
drewstone merged 4 commits into
mainfrom
feat/produced-state-proposal-content

drewstone commented Jun 14, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools commented Jun 14, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools commented Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 14, 2026

Why

What

Proof

Scope / rollout

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — cfc4f3ce

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟢 Value Audit — sound

Uh oh!

tangletools commented Jun 14, 2026

⚠️ Review Incomplete — cfc4f3ce

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟢 Value Audit — sound

Uh oh!

tangletools commented Jun 14, 2026

✅ No Blockers — cfc4f3ce

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `cfc4f3ce`

⚠️ Review Incomplete — `cfc4f3ce`

✅ No Blockers — `cfc4f3ce`