fix(produced-state): faithful completion oracle — content-aware, ungameable, form-vocabulary-robust#248
Conversation
…ss-checks proposals ProposalEventLike gains an optional content field; extractProducedState threads it into ProducedProposal.content (which verifyCompletion already reads). Without this a producer that emits a proposal body had it dropped, so the completion oracle could only grade proposals on presence + title match — never on whether the body is the right deliverable. Additive and backward-compatible: a proposal with no content is still graded presence-only.
…ess) The completion oracle matched proposals on title-only and artifacts on path+kind-only — content-blind. A correctly-produced deliverable with a short label title or a generic path scored 0 even when its body fully covered the requirement (proven: an OpenUI comparison + orphan/refusal proposals scoring 0 on real runs). Include the proposal/artifact BODY in the token-recall match so the real deliverable is credited; MATCH_THRESHOLD + the requirement's distinctive tokens keep an off-topic item from matching (anti-game tests added). NS-1 grader-fidelity work — fleet-wide substrate change: needs cross-product regression on every consumer's eval before a fleet bump. Not yet published.
…contract The two-stage completion oracle (content-aware structural match → correctness) is only ungameable when the correctness stage is SEMANTIC. The structural stage is lexical token recall, so a negation that contains the requirement's tokens matches it; pairing it with the lexical token-recall checker collapses to one gameable gate. - proposalCandidates: require an assessable body (>= MIN_CONTENT_CHARS). A bare title carried no content to correctness-check, so it auto-passed the oracle (structurallyPresent && correct===null -> satisfied). Tool calls remain the only legitimately content-less deliverable. - createTokenRecallChecker: documented polarity-blind; demoted to an opt-in structural pre-filter. Produced-state correctness MUST use the semantic checker. - anti-game fixtures: title-only fails, thin-body fails, a token-complete negation is caught by the semantic checker and (the documented reason) sails past the lexical one. 2273 tests green.
…l recall A requirement worded as a deliverable shape — 'Generated swap-comparison view persisted as a ui/** artifact' — recalled only 3/7 tokens (0.429, below the 0.5 match threshold) against a correct OpenUI artifact, because 'generated', 'view', 'persisted', 'artifact' are form words the JSON body never echoes. The discriminative tokens are the domain nouns (swap, comparison). Strip a small deliverable-FORM stopword set from the REQUIREMENT side of token recall so a deliverable matches on what it is ABOUT, not the boilerplate naming its form. Mirrors TITLE_STOPWORDS in the correctness checker. Anti-game holds: an off-topic artifact still lacks the domain nouns (pinned by a new negative test). Verified on the blind canonical run c9115ded: the openui persona goes 0.00 -> 1.00 (16 OpenUI artifacts credited) with zero false un-blinding on the other 14 personas. 2275 tests green.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — cfc4f3ce
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T02:14:05Z
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 0 (none) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 30.0s (2 bridge agents) |
| Total | 30.0s |
No concerns — sound change, no better or existing approach found. ✅
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
|
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 0 (none) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 30.0s (2 bridge agents) |
| Total | 30.0s |
No concerns — sound change, no better or existing approach found. ✅
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 76 | 86 | 76 |
| Confidence | 80 | 80 | 80 |
| Correctness | 76 | 86 | 76 |
| Security | 76 | 86 | 76 |
| Testing | 76 | 86 | 76 |
| Architecture | 76 | 86 | 76 |
Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM 'view' in REQUIREMENT_FORM_STOPWORDS may over-strip domain-significant tokens — src/completion-verifier.ts
Line 158: 'view' is included in the deliverable-FORM stopword set. In UI-generation contexts this is correct (e.g. 'Generated swap-comparison view' -> 'swap-comparison' are the domain tokens). However 'view' also appears as a genuine domain noun in requirements like 'Customer Lifecycle View analysis' where stripping it degrades recall. The risk is partially mitigated because stripping is requirement-side only and other domain tokens usually survive, but a requirement consisting solely of ['view', fieldName] could drop below MATCH_THRESHOLD. Consider whether 'view' should be context-gated or documented as a known trade-off.
🟡 LOW Boundary conditions for MATCH_THRESHOLD and MIN_CONTENT_CHARS untested — src/completion-verifier.test.ts
No test exercises exactly MIN_CONTENT_CHARS=50 chars (thin-body test uses 9 chars; approved proposals use ~175 chars) or exactly MATCH_THRESHOLD=0.5 recall. The >= 0.5 vs > 0.5 decision at artifactCandidates:234 determines whether edge-case items match. Not a bug — all production scenarios comfortably clear the thresholds — but a change to strict-inequality (> 0.5) could silently break matching without a test to catch it.
🟡 LOW Non-discriminative token 'as' survives in form-vocabulary test requirement set — src/completion-verifier.test.ts
In the test at line 259-288, the requirement 'Generated swap-comparison view persisted as a ui/** artifact' tokenizes to {swap, comparison, as, ui} after form-stopword stripping. The token 'as' (length 2, not in STOPWORDS or REQUIREMENT_FORM_STOPWORDS) is non-discriminative noise. It does not cause a test failure because the anti-game test at line 290-316 confirms the off-topic artifact still fails recall. This is a pre-existing characteristic of the token function, not introduced by this PR, but the test fixture would be
🟡 LOW Requirements titled entirely with form vocabulary silently match nothing — src/completion-verifier.ts
Lines 195-201: tokenRecall strips REQUIREMENT_FORM_STOPWORDS from the requirement side, then returns 0 when req.size === 0. A requirement title like 'Create a file' or 'Generated output' would have ALL tokens stripped, leaving an empty set → score 0 → no candidate can ever match. This is an inherent design trade-off (such requirements are poorly specified — they should name domain content, not form), and the motivating real-world case ('Generated swap-comparison view persisted as a ui/** artifact') still has 'swap' and 'comparison' as distinctive tokens. But a consumer writing a thin requirement title could see a silent zero with no diagnostic. Consider: when r
🟡 LOW proposalCandidates body guard uses post-trim length for skip but pre-trim for matching — src/completion-verifier.ts
Lines 261-262: The guard
if (body.length < MIN_CONTENT_CHARS) continueuses the trimmed body length, which is correct for the gate. But on line 271,${p.title} ${body}uses the trimmed body — the tokenRecall sees the trimmed body. This is consistent but worth noting: a content string like ' ... 50+ spaces ... ' would be trimmed to empty, skip the guard, andbodywould be empty in the match text. Since the guard fires first, this is safe in practice, but the control flow would be marginally clearer if the early-return were `if ((p
🟡 LOW Mixed-stream E2E test does not exercise the content path — src/produced-state.test.ts
The 'normalizes a realistic mixed stream end to end' test (line 121-144) includes a proposal_created event without content, so the new content-threading branch is not covered in the integration-style test. The two new unit tests cover the contract directly, so this is coverage completeness only, not a correctness gap.
🟡 LOW content null-leak via structural typing — src/produced-state.ts
Line 107:
p.content !== undefinedpassesnullthrough to output{ content: null }, mismatchingProducedProposal.content?: string. TheRuntimeEventLikecatch-all{ type: string }permits structurally-typed input withcontent: null. DownstreamproposalCandidates(completion-verifier.ts:261) mitigates via(p.content ?? '').trim(), so no crash or incorrect verdict. Considerp.content != nullor extracting through acontent?: stringtype-guard. Consistent with the rest of the file's structural typing (artifact path uses?? ''), so low-priority.
tangletools · 2026-06-14T03:49:38Z · trace
…e map Releases the merged-but-unpublished work: - content-aware + ungameable produced-state completion grader (#248): require a body for a structural proposal match, strip deliverable-FORM vocabulary from requirement-side recall, anti-game fixtures. - eval surface map doc (#249): pick-by-use-when for run* + the in-band produced-state contract; retires the phantom persona-dispatch wrapper. Stricter verifyCompletion behavior — a notable minor; consumers adopt by bump.
…e map (#250) Releases the merged-but-unpublished work: - content-aware + ungameable produced-state completion grader (#248): require a body for a structural proposal match, strip deliverable-FORM vocabulary from requirement-side recall, anti-game fixtures. - eval surface map doc (#249): pick-by-use-when for run* + the in-band produced-state contract; retires the phantom persona-dispatch wrapper. Stricter verifyCompletion behavior — a notable minor; consumers adopt by bump.
Why
The produced-state completion oracle (
verifyCompletion) is the eval's anti-proxy guarantee: a score only means something if it confirms the agent actually did the regulated work — filed the proposal, wrote the refusal note, rendered the comparison — not described it in prose. Three defects let it lie in both directions.False negatives (the dominant one). On a real insurance canonical run (
c9115ded), the agent produced 16 OpenUI artifacts, 55 transcript proposals, 17 call-note proposals — and scored 0/15 personas. The matcher keyed on path+kind (artifacts) and title-only (proposals), blind to content; and a requirement worded as a deliverable shape ("Generated swap-comparison view persisted as a ui/** artifact") recalled only 3/7 tokens (0.429, below threshold) becausegenerated/view/persisted/artifactare form words the JSON body never echoes.False positives (the gameable seam). The two-stage check (structural → correctness) collapsed to one lexical gate: a title-only proposal (
content=null) auto-passed correctness (correct=null → satisfied), and pairing the structural token-recall with the lexical token-recall correctness checker let a negation that contains the requirement's tokens ("I will NOT do the swap analysis") pass both stages.What
MIN_CONTENT_CHARS). Tool calls remain the only legitimately content-less deliverable. Closes the title-only bypass.generated,view,persisted,artifact,note,proposal,flag, …) from the requirement side of token recall, so a deliverable matches on what it is about (the domain nouns) not the boilerplate naming its form. MirrorsTITLE_STOPWORDSin the correctness checker.createTokenRecallCheckerand demote it to an opt-in structural pre-filter: produced-state correctness MUST be semantic (createLlmCorrectnessChecker). The structural stage is lexical and cannot, alone, catch negation.Proof
Offline re-grade of the blind run
c9115dedagainst the hardened oracle:The other 14 stay at 0.00 correctly — their proposals were captured before the body-threading extractor (this branch's
6b144ae), so they carry no body to grade. The proposal-content path is unit-proven (the anti-game + routing fixtures use content-bearing proposals) but its live capture is validated by a fresh run, not this old artifact.2275 tests green · typecheck clean · build clean.
Scope / rollout
This is an NS-1 faithfulness change to a fleet-shared substrate (insurance, legal, gtm, creative, agent-builder all consume
verifyCompletion). It is stricter (require body, strip form words), so consumers should re-baseline before any fleet bump — scores that previously inflated on title-only / form-vocabulary matches will correct downward; genuine produced deliverables will newly count. Not published in this PR.