docs(eval): surface map + retire the phantom persona-dispatch wrapper#249
Conversation
…na-dispatch wrapper The eval run* primitives read as a confusing pile to a new developer. Add a single pick-by-'use-when' table (runCampaign / runEval / runProfileMatrix / runOptimization / runImprovementLoop / runEvalCampaign / runAgentMatrix), each with its distinct role, so the surface stops looking interchangeable. Document the produced-state grading composition: the composition point is a judge wrapping verifyCompletion(extractProducedState(events)) — so runProducedStatePersonaDispatch does NOT exist and should not be built. State the in-band body contract (artifact + proposal_created both carry content). Cross-linked from concepts.md. Docs only.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 640bc017
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T11:21:57Z
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 3 (3 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 120.0s (2 bridge agents) |
| Total | 120.0s |
💰 Value — sound-with-nits
Adds a concise, cross-linked eval-primitive selector map and an anti-wrapper guidance doc for produced-state grading — a coherent docs-only adoption fix, slightly marred by imprecise return-type shorthand.
- What it does: Creates
docs/eval-surface-map.mdand adds one cross-link indocs/concepts.md:186. The new doc provides a pick-by-"use-when" table for the sevenrun*primitives (runCampaign,runEval,runProfileMatrix,runOptimization,runImprovementLoop,runEvalCampaign,runAgentMatrix), explains the produced-state grading composition (extractProducedState→verifyCompletionwrapped as a ju - Goals it achieves: Reduce adoption confusion where the eval primitives look interchangeable, prevent consumers from inventing redundant wrappers, and steer produced-state grading to the existing judge-composition point rather than a new runner. It also locks in the rule that deliverable bodies should travel in-band on events instead of being re-fetched from product databases.
- Assessment: Good on its merits. The repo already uses standalone "map" docs (
docs/self-improvement-map.md,docs/feature-guide.md) to disambiguate confusing surfaces, so a neweval-surface-map.mdfits the grain. The produced-state guidance is grounded in the real substrate adapter atsrc/campaign/presets/playback.ts:67-103, which composesextractProducedState+verifyCompletioninto a `runProfileMa - Better / existing approach: none — this is the right approach. A separate map doc is consistent with
self-improvement-map.md, and neitherdocs/feature-guide.mdnordocs/design/primitives-integration-spec.mdalready provides this primitive-by-primitive, use-when table or the produced-state wrapper warning. The produced-state guidance could in theory live inprimitives-integration-spec.md, but the anti-wrapper narrativ
🎯 Usefulness — sound-with-nits
A docs-only clarification that is well-integrated via concepts.md/README.md and complements existing guides, with two minor accuracy nits on return-type shorthand and the proposal content contract.
- Integration: The new doc is reachable: README.md:167 sends readers to docs/concepts.md, and concepts.md:186 now links to docs/eval-surface-map.md. It is not dead surface; it sits on the main docs path an adopter follows after install.
- Fit with existing patterns: It fits the existing code and doc conventions. The mental model it documents (measure → factor → generate → gate) matches the implementation files: runCampaign/runEval in src/campaign/*, runProfileMatrix in src/campaign/presets/run-profile-matrix.ts, runOptimization/runImprovementLoop in src/campaign/presets/run-optimization.ts and run-improvement-loop.ts, runEvalCampaign in src/eval-campaign.ts,
- Real-world viability: As documentation it has no runtime path to stress, and it correctly reflects the underlying primitives' real-backend integrity guard (runProfileMatrix.ts:17-29), cost ceiling (runAgentMatrix runner.ts:89), and holdout gate (run-improvement-loop.ts:97-111). One contract claim is slightly ahead of the code: the doc states proposal_created events carry content (eval-surface-map.md:51-53), but src/pro
🎯 Usefulness Audit
🟡 Return-type shorthand in the table oversimplifies two primitives [ergonomics] ``
eval-surface-map.md:14 says runProfileMatrix returns
RunRecord[], but the function returnsRunProfileMatrixResult(src/campaign/presets/run-profile-matrix.ts:170-185) which containsrecords: RunRecord[]plus byProfile/byScenario/integrity/campaigns. eval-surface-map.md:17 says runEvalCampaign returnsCampaignResult + records, but it returnsEvalCampaignResult(src/eval-campaign.ts:274-288) withruns: RunRecord[], integrityReports, failedRuns, etc., and noCampaignResult. Suggested
🟡 Proposal content contract is documented but not yet implemented in produced-state extraction [problem-fit] ``
eval-surface-map.md:51-53 asserts that
proposal_createdevents carrycontent, but src/produced-state.ts:41-46ProposalEventLikehas nocontentfield, and lines 97-99 map the event to{ id, title, status }without copying content. The downstreamProducedProposal.contentfield exists (src/completion-verifier.ts:46-52) andproposalCandidates(completion-verifier.ts:211-229) will grade it when present, so the grader is ready but the extraction layer is not. Either addcontentto `Pro
💰 Value Audit
🟡 Return-type shorthand in the table is imprecise [maintenance] ``
The table says
runProfileMatrixreturnsRunRecord[], but the actual return isRunProfileMatrixResultcontainingrecords,campaigns,integrity, etc. (src/campaign/presets/run-profile-matrix.ts:170-185). It also saysrunEvalCampaignreturnsCampaignResult+ records, but the actual return isEvalCampaignResultwithruns,integrityReports,report, etc. (src/eval-campaign.ts:274-288). Better to list the real result types or note "including ..." to avoid misleading adopters
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 72 | 79 | 72 |
| Confidence | 70 | 70 | 70 |
| Correctness | 72 | 79 | 72 |
| Security | 72 | 79 | 72 |
| Testing | 72 | 79 | 72 |
| Architecture | 72 | 79 | 72 |
Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM Doc claims proposal events carry content, but extraction code never captures it — docs/eval-surface-map.md
Line 52 states: 'proposal_created events carry content (the submit_proposal description) — same role, same field name.' This is factually wrong against the current code. ProposalEventLike (src/produced-state.ts:41-46) has no
contentfield — only proposalId, title, status. extractProducedState (produced-state.ts:99) maps proposals as{ id: p.proposalId, title: p.title, status: p.status ?? 'pending' }with no content. ProducedProposal.content exists as optional in completion-verifier.ts:51 but is NEVER populated by the extraction. Impact: an integrator following this doc will expect proposals to flow through verifyCompletion's correctness check (completion-verifier.t
🟠 MEDIUM runEvalCampaign returns EvalCampaignResult, not CampaignResult — docs/eval-surface-map.md
Line 17 table cell: 'CampaignResult + records'. Actual return is EvalCampaignResult (src/eval-campaign.ts:274) — a separate interface with no shared fields with CampaignResult: runs: RunRecord[], integrityReports, failedRuns, report, campaignFingerprint, preregistrationHash. EvalCampaignResult does not extend CampaignResult and has no manifestHash, seed, cells, aggregates, optimization, or gate fields. Fix: s/CampaignResult + records/EvalCampaignResult/.
🟠 MEDIUM runProfileMatrix return type is RunProfileMatrixResult, not RunRecord[] — docs/eval-surface-map.md
Line 14 table cell: 'RunRecord[]'. Actual return is Promise (src/campaign/presets/run-profile-matrix.ts:170). RunProfileMatrixResult wraps RunRecord[] under .records but also carries matrixId, experimentId, byProfile, byScenario, byPersona, integrity, and campaigns. A consumer following this doc would access the return as a bare array and miss structural metadata. Fix: s/RunRecord[]/RunProfileMatrixResult/.
🟡 LOW Nav bullet is markedly denser than its siblings — docs/concepts.md
Line 186 is ~3x longer than the surrounding 'Where to go next' bullets and embeds three jargon phrases ('verifyCompletion-as-judge', 'persona-dispatch wrapper', 'in-band body contract'). The other bullets (self-improvement-map, feature-guide, wire-protocol) each lead with a one-line plain-language hook. The density is defensible — it signposts the 'do not build a wrapper' guardrail at the nav layer — but a reader scanning the list for a quick orientation may stumble. No correctness impact; consider trimming to a hook + deferred detail if this list is meant for skim readers.
🟡 LOW runEvalCampaign return type described as CampaignResult but returns EvalCampaignResult — docs/eval-surface-map.md
Line 17 says runEvalCampaign returns 'CampaignResult + records', using backtick-wrapped CampaignResult (a concrete type). The actual return type is EvalCampaignResult (src/eval-campaign.ts:303-305), which (eval-campaign.ts:274-288) contains runs: RunRecord[], integrityReports: RunIntegrityReport[], failedRuns: FailedRun[], and optional report: ResearchReport. There is no CampaignResult field anywhere in EvalCampaignResult. Fix: change the table cell to 'EvalCampaignResult' or 'RunRecord[] + integrity reports' to match the actual type.
🟡 LOW verifyCompletion parameter names differ from source — docs/eval-surface-map.md
Line 36 pipeline diagram shows verifyCompletion(taskGold, state, correctnessChecker). Actual signature (src/completion-verifier.ts:259) uses gold (not taskGold) and checkCorrectness (not correctnessChecker). Cosmetic — no semantic difference — but the doc being wrong about the parameter names undermines its authority when a reader copy-pastes from it.
tangletools · 2026-06-14T11:37:47Z · trace
…e map Releases the merged-but-unpublished work: - content-aware + ungameable produced-state completion grader (#248): require a body for a structural proposal match, strip deliverable-FORM vocabulary from requirement-side recall, anti-game fixtures. - eval surface map doc (#249): pick-by-use-when for run* + the in-band produced-state contract; retires the phantom persona-dispatch wrapper. Stricter verifyCompletion behavior — a notable minor; consumers adopt by bump.
…e map (#250) Releases the merged-but-unpublished work: - content-aware + ungameable produced-state completion grader (#248): require a body for a structural proposal match, strip deliverable-FORM vocabulary from requirement-side recall, anti-game fixtures. - eval surface map doc (#249): pick-by-use-when for run* + the in-band produced-state contract; retires the phantom persona-dispatch wrapper. Stricter verifyCompletion behavior — a notable minor; consumers adopt by bump.
Why
The
run*eval primitives read as a confusing, interchangeable pile to a developer adopting the substrate — and that confusion is what spawns redundant wrappers (the proposedrunProducedStatePersonaDispatchbeing the latest). This makes the existing surface legible so no one reaches for a new wrapper.What
docs/eval-surface-map.md(cross-linked fromconcepts.md):runCampaign/runEval/runProfileMatrix/runOptimization/runImprovementLoop/runEvalCampaign/runAgentMatrix— each with its distinct, non-overlapping role. Mental model: measure → factor → generate → gate.runtime events → extractProducedState → verifyCompletion as a JudgeConfig. States plainly thatrunProducedStatePersonaDispatchdoes not exist and should not be built — the judge IS the composition point.artifactandproposal_createdboth carrycontent(see agent-runtime#292, agent-app#55); a consumer re-fetching a body from its own DB is working around a thin event, not missing a wrapper.Docs only.