Skip to content

fix(produced-state): faithful completion oracle — content-aware, ungameable, form-vocabulary-robust#248

Merged
drewstone merged 4 commits into
mainfrom
feat/produced-state-proposal-content
Jun 14, 2026
Merged

fix(produced-state): faithful completion oracle — content-aware, ungameable, form-vocabulary-robust#248
drewstone merged 4 commits into
mainfrom
feat/produced-state-proposal-content

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Why

The produced-state completion oracle (verifyCompletion) is the eval's anti-proxy guarantee: a score only means something if it confirms the agent actually did the regulated work — filed the proposal, wrote the refusal note, rendered the comparison — not described it in prose. Three defects let it lie in both directions.

False negatives (the dominant one). On a real insurance canonical run (c9115ded), the agent produced 16 OpenUI artifacts, 55 transcript proposals, 17 call-note proposals — and scored 0/15 personas. The matcher keyed on path+kind (artifacts) and title-only (proposals), blind to content; and a requirement worded as a deliverable shape ("Generated swap-comparison view persisted as a ui/** artifact") recalled only 3/7 tokens (0.429, below threshold) because generated/view/persisted/artifact are form words the JSON body never echoes.

False positives (the gameable seam). The two-stage check (structural → correctness) collapsed to one lexical gate: a title-only proposal (content=null) auto-passed correctness (correct=null → satisfied), and pairing the structural token-recall with the lexical token-recall correctness checker let a negation that contains the requirement's tokens ("I will NOT do the swap analysis") pass both stages.

What

  • Require an assessable body for a proposal to count (≥ MIN_CONTENT_CHARS). Tool calls remain the only legitimately content-less deliverable. Closes the title-only bypass.
  • Strip deliverable-FORM vocabulary (generated, view, persisted, artifact, note, proposal, flag, …) from the requirement side of token recall, so a deliverable matches on what it is about (the domain nouns) not the boilerplate naming its form. Mirrors TITLE_STOPWORDS in the correctness checker.
  • Document the polarity-blindness of createTokenRecallChecker and demote it to an opt-in structural pre-filter: produced-state correctness MUST be semantic (createLlmCorrectnessChecker). The structural stage is lexical and cannot, alone, catch negation.
  • Anti-game fixtures pin all of it: title-only fails, thin-body fails, off-topic (even with body / even after form-strip) fails, and a token-complete negation is caught by the semantic checker — and (the documented reason) sails past the lexical one.

Proof

Offline re-grade of the blind run c9115ded against the hardened oracle:

persona old hardened note
openui-swap-comparison 0.00 1.00 16 OpenUI artifacts now credited
other 14 personas 0.00 0.00 no false un-blinding (anti-game holds)

The other 14 stay at 0.00 correctly — their proposals were captured before the body-threading extractor (this branch's 6b144ae), so they carry no body to grade. The proposal-content path is unit-proven (the anti-game + routing fixtures use content-bearing proposals) but its live capture is validated by a fresh run, not this old artifact.

2275 tests green · typecheck clean · build clean.

Scope / rollout

This is an NS-1 faithfulness change to a fleet-shared substrate (insurance, legal, gtm, creative, agent-builder all consume verifyCompletion). It is stricter (require body, strip form words), so consumers should re-baseline before any fleet bump — scores that previously inflated on title-only / form-vocabulary matches will correct downward; genuine produced deliverables will newly count. Not published in this PR.

…ss-checks proposals

ProposalEventLike gains an optional content field; extractProducedState threads
it into ProducedProposal.content (which verifyCompletion already reads). Without
this a producer that emits a proposal body had it dropped, so the completion
oracle could only grade proposals on presence + title match — never on whether
the body is the right deliverable. Additive and backward-compatible: a proposal
with no content is still graded presence-only.
…ess)

The completion oracle matched proposals on title-only and artifacts on
path+kind-only — content-blind. A correctly-produced deliverable with a short
label title or a generic path scored 0 even when its body fully covered the
requirement (proven: an OpenUI comparison + orphan/refusal proposals scoring 0
on real runs). Include the proposal/artifact BODY in the token-recall match so
the real deliverable is credited; MATCH_THRESHOLD + the requirement's
distinctive tokens keep an off-topic item from matching (anti-game tests added).

NS-1 grader-fidelity work — fleet-wide substrate change: needs cross-product
regression on every consumer's eval before a fleet bump. Not yet published.
…contract

The two-stage completion oracle (content-aware structural match → correctness)
is only ungameable when the correctness stage is SEMANTIC. The structural stage
is lexical token recall, so a negation that contains the requirement's tokens
matches it; pairing it with the lexical token-recall checker collapses to one
gameable gate.

- proposalCandidates: require an assessable body (>= MIN_CONTENT_CHARS). A bare
  title carried no content to correctness-check, so it auto-passed the oracle
  (structurallyPresent && correct===null -> satisfied). Tool calls remain the
  only legitimately content-less deliverable.
- createTokenRecallChecker: documented polarity-blind; demoted to an opt-in
  structural pre-filter. Produced-state correctness MUST use the semantic
  checker.
- anti-game fixtures: title-only fails, thin-body fails, a token-complete
  negation is caught by the semantic checker and (the documented reason) sails
  past the lexical one.

2273 tests green.
…l recall

A requirement worded as a deliverable shape — 'Generated swap-comparison view
persisted as a ui/** artifact' — recalled only 3/7 tokens (0.429, below the 0.5
match threshold) against a correct OpenUI artifact, because 'generated', 'view',
'persisted', 'artifact' are form words the JSON body never echoes. The
discriminative tokens are the domain nouns (swap, comparison).

Strip a small deliverable-FORM stopword set from the REQUIREMENT side of token
recall so a deliverable matches on what it is ABOUT, not the boilerplate naming
its form. Mirrors TITLE_STOPWORDS in the correctness checker. Anti-game holds: an
off-topic artifact still lacks the domain nouns (pinned by a new negative test).

Verified on the blind canonical run c9115ded: the openui persona goes 0.00 ->
1.00 (16 OpenUI artifacts credited) with zero false un-blinding on the other 14
personas. 2275 tests green.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — cfc4f3ce

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T02:14:05Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 0 (none)
Heuristic 0.0s
Duplication 0.0s
Interrogation 30.0s (2 bridge agents)
Total 30.0s

No concerns — sound change, no better or existing approach found. ✅


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260614T024653Z

@tangletools

Copy link
Copy Markdown
Contributor

⚠️ Review Incomplete — cfc4f3ce

At least one required reviewer lane failed closed. No approval or request-changes review was published. Trigger a fresh review on the current PR head.

tangletools · 2026-06-14T03:05:12Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 0 (none)
Heuristic 0.0s
Duplication 0.0s
Interrogation 30.0s (2 bridge agents)
Total 30.0s

No concerns — sound change, no better or existing approach found. ✅


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260614T033818Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — cfc4f3ce

Readiness 76/100 · Confidence 80/100 · 7 findings (1 medium, 6 low)

deepseek glm aggregate
Readiness 76 86 76
Confidence 80 80 80
Correctness 76 86 76
Security 76 86 76
Testing 76 86 76
Architecture 76 86 76

Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM 'view' in REQUIREMENT_FORM_STOPWORDS may over-strip domain-significant tokens — src/completion-verifier.ts

Line 158: 'view' is included in the deliverable-FORM stopword set. In UI-generation contexts this is correct (e.g. 'Generated swap-comparison view' -> 'swap-comparison' are the domain tokens). However 'view' also appears as a genuine domain noun in requirements like 'Customer Lifecycle View analysis' where stripping it degrades recall. The risk is partially mitigated because stripping is requirement-side only and other domain tokens usually survive, but a requirement consisting solely of ['view', fieldName] could drop below MATCH_THRESHOLD. Consider whether 'view' should be context-gated or documented as a known trade-off.

🟡 LOW Boundary conditions for MATCH_THRESHOLD and MIN_CONTENT_CHARS untested — src/completion-verifier.test.ts

No test exercises exactly MIN_CONTENT_CHARS=50 chars (thin-body test uses 9 chars; approved proposals use ~175 chars) or exactly MATCH_THRESHOLD=0.5 recall. The >= 0.5 vs > 0.5 decision at artifactCandidates:234 determines whether edge-case items match. Not a bug — all production scenarios comfortably clear the thresholds — but a change to strict-inequality (> 0.5) could silently break matching without a test to catch it.

🟡 LOW Non-discriminative token 'as' survives in form-vocabulary test requirement set — src/completion-verifier.test.ts

In the test at line 259-288, the requirement 'Generated swap-comparison view persisted as a ui/** artifact' tokenizes to {swap, comparison, as, ui} after form-stopword stripping. The token 'as' (length 2, not in STOPWORDS or REQUIREMENT_FORM_STOPWORDS) is non-discriminative noise. It does not cause a test failure because the anti-game test at line 290-316 confirms the off-topic artifact still fails recall. This is a pre-existing characteristic of the token function, not introduced by this PR, but the test fixture would be

🟡 LOW Requirements titled entirely with form vocabulary silently match nothing — src/completion-verifier.ts

Lines 195-201: tokenRecall strips REQUIREMENT_FORM_STOPWORDS from the requirement side, then returns 0 when req.size === 0. A requirement title like 'Create a file' or 'Generated output' would have ALL tokens stripped, leaving an empty set → score 0 → no candidate can ever match. This is an inherent design trade-off (such requirements are poorly specified — they should name domain content, not form), and the motivating real-world case ('Generated swap-comparison view persisted as a ui/** artifact') still has 'swap' and 'comparison' as distinctive tokens. But a consumer writing a thin requirement title could see a silent zero with no diagnostic. Consider: when r

🟡 LOW proposalCandidates body guard uses post-trim length for skip but pre-trim for matching — src/completion-verifier.ts

Lines 261-262: The guard if (body.length < MIN_CONTENT_CHARS) continue uses the trimmed body length, which is correct for the gate. But on line 271, ${p.title} ${body} uses the trimmed body — the tokenRecall sees the trimmed body. This is consistent but worth noting: a content string like ' ... 50+ spaces ... ' would be trimmed to empty, skip the guard, and body would be empty in the match text. Since the guard fires first, this is safe in practice, but the control flow would be marginally clearer if the early-return were `if ((p

🟡 LOW Mixed-stream E2E test does not exercise the content path — src/produced-state.test.ts

The 'normalizes a realistic mixed stream end to end' test (line 121-144) includes a proposal_created event without content, so the new content-threading branch is not covered in the integration-style test. The two new unit tests cover the contract directly, so this is coverage completeness only, not a correctness gap.

🟡 LOW content null-leak via structural typing — src/produced-state.ts

Line 107: p.content !== undefined passes null through to output { content: null }, mismatching ProducedProposal.content?: string. The RuntimeEventLike catch-all { type: string } permits structurally-typed input with content: null. Downstream proposalCandidates (completion-verifier.ts:261) mitigates via (p.content ?? '').trim(), so no crash or incorrect verdict. Consider p.content != null or extracting through a content?: string type-guard. Consistent with the rest of the file's structural typing (artifact path uses ?? ''), so low-priority.


tangletools · 2026-06-14T03:49:38Z · trace

@drewstone drewstone merged commit a10c25e into main Jun 14, 2026
1 check passed
drewstone added a commit that referenced this pull request Jun 14, 2026
…e map

Releases the merged-but-unpublished work:
- content-aware + ungameable produced-state completion grader (#248): require
  a body for a structural proposal match, strip deliverable-FORM vocabulary
  from requirement-side recall, anti-game fixtures.
- eval surface map doc (#249): pick-by-use-when for run* + the in-band
  produced-state contract; retires the phantom persona-dispatch wrapper.

Stricter verifyCompletion behavior — a notable minor; consumers adopt by bump.
drewstone added a commit that referenced this pull request Jun 14, 2026
…e map (#250)

Releases the merged-but-unpublished work:
- content-aware + ungameable produced-state completion grader (#248): require
  a body for a structural proposal match, strip deliverable-FORM vocabulary
  from requirement-side recall, anti-game fixtures.
- eval surface map doc (#249): pick-by-use-when for run* + the in-band
  produced-state contract; retires the phantom persona-dispatch wrapper.

Stricter verifyCompletion behavior — a notable minor; consumers adopt by bump.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants