docs(eval): surface map + retire the phantom persona-dispatch wrapper by drewstone · Pull Request #249 · tangle-network/agent-eval

drewstone · 2026-06-14T11:21:51Z

Why

The run* eval primitives read as a confusing, interchangeable pile to a developer adopting the substrate — and that confusion is what spawns redundant wrappers (the proposed runProducedStatePersonaDispatch being the latest). This makes the existing surface legible so no one reaches for a new wrapper.

What

docs/eval-surface-map.md (cross-linked from concepts.md):

A pick-by-use-when table for runCampaign / runEval / runProfileMatrix / runOptimization / runImprovementLoop / runEvalCampaign / runAgentMatrix — each with its distinct, non-overlapping role. Mental model: measure → factor → generate → gate.
The produced-state grading composition: runtime events → extractProducedState → verifyCompletion as a JudgeConfig. States plainly that runProducedStatePersonaDispatch does not exist and should not be built — the judge IS the composition point.
The in-band body contract: artifact and proposal_created both carry content (see agent-runtime#292, agent-app#55); a consumer re-fetching a body from its own DB is working around a thin event, not missing a wrapper.

Docs only.

…na-dispatch wrapper The eval run* primitives read as a confusing pile to a new developer. Add a single pick-by-'use-when' table (runCampaign / runEval / runProfileMatrix / runOptimization / runImprovementLoop / runEvalCampaign / runAgentMatrix), each with its distinct role, so the surface stops looking interchangeable. Document the produced-state grading composition: the composition point is a judge wrapping verifyCompletion(extractProducedState(events)) — so runProducedStatePersonaDispatch does NOT exist and should not be built. State the in-band body contract (artifact + proposal_created both carry content). Cross-linked from concepts.md. Docs only.

tangletools

✅ Auto-approved PR — `640bc017`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T11:21:57Z}

tangletools

🟡 Value Audit — sound-with-nits


Verdict	sound-with-nits
Concerns	3 (3 weak-concern)
Heuristic	0.0s
Duplication	0.0s
Interrogation	120.0s (2 bridge agents)
Total	120.0s

💰 Value — sound-with-nits

Adds a concise, cross-linked eval-primitive selector map and an anti-wrapper guidance doc for produced-state grading — a coherent docs-only adoption fix, slightly marred by imprecise return-type shorthand.

What it does: Creates docs/eval-surface-map.md and adds one cross-link in docs/concepts.md:186. The new doc provides a pick-by-"use-when" table for the seven run* primitives (runCampaign, runEval, runProfileMatrix, runOptimization, runImprovementLoop, runEvalCampaign, runAgentMatrix), explains the produced-state grading composition (extractProducedState → verifyCompletion wrapped as a ju
Goals it achieves: Reduce adoption confusion where the eval primitives look interchangeable, prevent consumers from inventing redundant wrappers, and steer produced-state grading to the existing judge-composition point rather than a new runner. It also locks in the rule that deliverable bodies should travel in-band on events instead of being re-fetched from product databases.
Assessment: Good on its merits. The repo already uses standalone "map" docs (docs/self-improvement-map.md, docs/feature-guide.md) to disambiguate confusing surfaces, so a new eval-surface-map.md fits the grain. The produced-state guidance is grounded in the real substrate adapter at src/campaign/presets/playback.ts:67-103, which composes extractProducedState + verifyCompletion into a `runProfileMa
Better / existing approach: none — this is the right approach. A separate map doc is consistent with self-improvement-map.md, and neither docs/feature-guide.md nor docs/design/primitives-integration-spec.md already provides this primitive-by-primitive, use-when table or the produced-state wrapper warning. The produced-state guidance could in theory live in primitives-integration-spec.md, but the anti-wrapper narrativ

🎯 Usefulness — sound-with-nits

A docs-only clarification that is well-integrated via concepts.md/README.md and complements existing guides, with two minor accuracy nits on return-type shorthand and the proposal content contract.

Integration: The new doc is reachable: README.md:167 sends readers to docs/concepts.md, and concepts.md:186 now links to docs/eval-surface-map.md. It is not dead surface; it sits on the main docs path an adopter follows after install.
Fit with existing patterns: It fits the existing code and doc conventions. The mental model it documents (measure → factor → generate → gate) matches the implementation files: runCampaign/runEval in src/campaign/*, runProfileMatrix in src/campaign/presets/run-profile-matrix.ts, runOptimization/runImprovementLoop in src/campaign/presets/run-optimization.ts and run-improvement-loop.ts, runEvalCampaign in src/eval-campaign.ts,
Real-world viability: As documentation it has no runtime path to stress, and it correctly reflects the underlying primitives' real-backend integrity guard (runProfileMatrix.ts:17-29), cost ceiling (runAgentMatrix runner.ts:89), and holdout gate (run-improvement-loop.ts:97-111). One contract claim is slightly ahead of the code: the doc states proposal_created events carry content (eval-surface-map.md:51-53), but src/pro

🎯 Usefulness Audit

🟡 Return-type shorthand in the table oversimplifies two primitives [ergonomics] ``

eval-surface-map.md:14 says runProfileMatrix returns RunRecord[], but the function returns RunProfileMatrixResult (src/campaign/presets/run-profile-matrix.ts:170-185) which contains records: RunRecord[] plus byProfile/byScenario/integrity/campaigns. eval-surface-map.md:17 says runEvalCampaign returns CampaignResult + records, but it returns EvalCampaignResult (src/eval-campaign.ts:274-288) with runs: RunRecord[], integrityReports, failedRuns, etc., and no CampaignResult. Suggested

🟡 Proposal content contract is documented but not yet implemented in produced-state extraction [problem-fit] ``

eval-surface-map.md:51-53 asserts that proposal_created events carry content, but src/produced-state.ts:41-46 ProposalEventLike has no content field, and lines 97-99 map the event to { id, title, status } without copying content. The downstream ProducedProposal.content field exists (src/completion-verifier.ts:46-52) and proposalCandidates (completion-verifier.ts:211-229) will grade it when present, so the grader is ready but the extraction layer is not. Either add content to `Pro

💰 Value Audit

🟡 Return-type shorthand in the table is imprecise [maintenance] ``

The table says runProfileMatrix returns RunRecord[], but the actual return is RunProfileMatrixResult containing records, campaigns, integrity, etc. (src/campaign/presets/run-profile-matrix.ts:170-185). It also says runEvalCampaign returns CampaignResult + records, but the actual return is EvalCampaignResult with runs, integrityReports, report, etc. (src/eval-campaign.ts:274-288). Better to list the real result types or note "including ..." to avoid misleading adopters

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260614T113429Z}

tangletools · 2026-06-14T11:37:49Z

✅ No Blockers — `640bc017`

Readiness 72/100 · Confidence 70/100 · 6 findings (3 medium, 3 low)

	deepseek	glm	aggregate
Readiness	72	79	72
Confidence	70	70	70
Correctness	72	79	72
Security	72	79	72
Testing	72	79	72
Architecture	72	79	72

Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Doc claims proposal events carry content, but extraction code never captures it — docs/eval-surface-map.md

Line 52 states: 'proposal_created events carry content (the submit_proposal description) — same role, same field name.' This is factually wrong against the current code. ProposalEventLike (src/produced-state.ts:41-46) has no content field — only proposalId, title, status. extractProducedState (produced-state.ts:99) maps proposals as { id: p.proposalId, title: p.title, status: p.status ?? 'pending' } with no content. ProducedProposal.content exists as optional in completion-verifier.ts:51 but is NEVER populated by the extraction. Impact: an integrator following this doc will expect proposals to flow through verifyCompletion's correctness check (completion-verifier.t

🟠 MEDIUM runEvalCampaign returns EvalCampaignResult, not CampaignResult — docs/eval-surface-map.md

Line 17 table cell: 'CampaignResult + records'. Actual return is EvalCampaignResult (src/eval-campaign.ts:274) — a separate interface with no shared fields with CampaignResult: runs: RunRecord[], integrityReports, failedRuns, report, campaignFingerprint, preregistrationHash. EvalCampaignResult does not extend CampaignResult and has no manifestHash, seed, cells, aggregates, optimization, or gate fields. Fix: s/CampaignResult + records/EvalCampaignResult/.

🟠 MEDIUM runProfileMatrix return type is RunProfileMatrixResult, not RunRecord[] — docs/eval-surface-map.md

Line 14 table cell: 'RunRecord[]'. Actual return is Promise (src/campaign/presets/run-profile-matrix.ts:170). RunProfileMatrixResult wraps RunRecord[] under .records but also carries matrixId, experimentId, byProfile, byScenario, byPersona, integrity, and campaigns. A consumer following this doc would access the return as a bare array and miss structural metadata. Fix: s/RunRecord[]/RunProfileMatrixResult/.

🟡 LOW Nav bullet is markedly denser than its siblings — docs/concepts.md

Line 186 is ~3x longer than the surrounding 'Where to go next' bullets and embeds three jargon phrases ('verifyCompletion-as-judge', 'persona-dispatch wrapper', 'in-band body contract'). The other bullets (self-improvement-map, feature-guide, wire-protocol) each lead with a one-line plain-language hook. The density is defensible — it signposts the 'do not build a wrapper' guardrail at the nav layer — but a reader scanning the list for a quick orientation may stumble. No correctness impact; consider trimming to a hook + deferred detail if this list is meant for skim readers.

🟡 LOW runEvalCampaign return type described as CampaignResult but returns EvalCampaignResult — docs/eval-surface-map.md

Line 17 says runEvalCampaign returns 'CampaignResult + records', using backtick-wrapped CampaignResult (a concrete type). The actual return type is EvalCampaignResult (src/eval-campaign.ts:303-305), which (eval-campaign.ts:274-288) contains runs: RunRecord[], integrityReports: RunIntegrityReport[], failedRuns: FailedRun[], and optional report: ResearchReport. There is no CampaignResult field anywhere in EvalCampaignResult. Fix: change the table cell to 'EvalCampaignResult' or 'RunRecord[] + integrity reports' to match the actual type.

🟡 LOW verifyCompletion parameter names differ from source — docs/eval-surface-map.md

Line 36 pipeline diagram shows verifyCompletion(taskGold, state, correctnessChecker). Actual signature (src/completion-verifier.ts:259) uses gold (not taskGold) and checkCorrectness (not correctnessChecker). Cosmetic — no semantic difference — but the doc being wrong about the parameter names undermines its authority when a reader copy-pastes from it.

_{tangletools · 2026-06-14T11:37:47Z · trace}

…e map Releases the merged-but-unpublished work: - content-aware + ungameable produced-state completion grader (#248): require a body for a structural proposal match, strip deliverable-FORM vocabulary from requirement-side recall, anti-game fixtures. - eval surface map doc (#249): pick-by-use-when for run* + the in-band produced-state contract; retires the phantom persona-dispatch wrapper. Stricter verifyCompletion behavior — a notable minor; consumers adopt by bump.

…e map (#250) Releases the merged-but-unpublished work: - content-aware + ungameable produced-state completion grader (#248): require a body for a structural proposal match, strip deliverable-FORM vocabulary from requirement-side recall, anti-game fixtures. - eval surface map doc (#249): pick-by-use-when for run* + the in-band produced-state contract; retires the phantom persona-dispatch wrapper. Stricter verifyCompletion behavior — a notable minor; consumers adopt by bump.

tangletools approved these changes Jun 14, 2026

View reviewed changes

tangletools reviewed Jun 14, 2026

View reviewed changes

drewstone merged commit 509c9fb into main Jun 14, 2026
1 check passed

drewstone mentioned this pull request Jun 14, 2026

chore(release): agent-eval 0.92.0 — faithful grader + surface map #250

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(eval): surface map + retire the phantom persona-dispatch wrapper#249

docs(eval): surface map + retire the phantom persona-dispatch wrapper#249
drewstone merged 1 commit into
mainfrom
chore/persona-eval-surface-map

drewstone commented Jun 14, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools commented Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 14, 2026

Why

What

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 640bc017

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟡 Value Audit — sound-with-nits

💰 Value — sound-with-nits

🎯 Usefulness — sound-with-nits

🎯 Usefulness Audit

💰 Value Audit

Uh oh!

tangletools commented Jun 14, 2026

✅ No Blockers — 640bc017

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `640bc017`

✅ No Blockers — `640bc017`