Skip to content

docs(eval): surface map + retire the phantom persona-dispatch wrapper#249

Merged
drewstone merged 1 commit into
mainfrom
chore/persona-eval-surface-map
Jun 14, 2026
Merged

docs(eval): surface map + retire the phantom persona-dispatch wrapper#249
drewstone merged 1 commit into
mainfrom
chore/persona-eval-surface-map

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Why

The run* eval primitives read as a confusing, interchangeable pile to a developer adopting the substrate — and that confusion is what spawns redundant wrappers (the proposed runProducedStatePersonaDispatch being the latest). This makes the existing surface legible so no one reaches for a new wrapper.

What

docs/eval-surface-map.md (cross-linked from concepts.md):

  • A pick-by-use-when table for runCampaign / runEval / runProfileMatrix / runOptimization / runImprovementLoop / runEvalCampaign / runAgentMatrix — each with its distinct, non-overlapping role. Mental model: measure → factor → generate → gate.
  • The produced-state grading composition: runtime events → extractProducedState → verifyCompletion as a JudgeConfig. States plainly that runProducedStatePersonaDispatch does not exist and should not be built — the judge IS the composition point.
  • The in-band body contract: artifact and proposal_created both carry content (see agent-runtime#292, agent-app#55); a consumer re-fetching a body from its own DB is working around a thin event, not missing a wrapper.

Docs only.

…na-dispatch wrapper

The eval run* primitives read as a confusing pile to a new developer. Add a
single pick-by-'use-when' table (runCampaign / runEval / runProfileMatrix /
runOptimization / runImprovementLoop / runEvalCampaign / runAgentMatrix), each
with its distinct role, so the surface stops looking interchangeable.

Document the produced-state grading composition: the composition point is a
judge wrapping verifyCompletion(extractProducedState(events)) — so
runProducedStatePersonaDispatch does NOT exist and should not be built. State
the in-band body contract (artifact + proposal_created both carry content).

Cross-linked from concepts.md. Docs only.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 640bc017

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T11:21:57Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 3 (3 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 120.0s (2 bridge agents)
Total 120.0s

💰 Value — sound-with-nits

Adds a concise, cross-linked eval-primitive selector map and an anti-wrapper guidance doc for produced-state grading — a coherent docs-only adoption fix, slightly marred by imprecise return-type shorthand.

  • What it does: Creates docs/eval-surface-map.md and adds one cross-link in docs/concepts.md:186. The new doc provides a pick-by-"use-when" table for the seven run* primitives (runCampaign, runEval, runProfileMatrix, runOptimization, runImprovementLoop, runEvalCampaign, runAgentMatrix), explains the produced-state grading composition (extractProducedStateverifyCompletion wrapped as a ju
  • Goals it achieves: Reduce adoption confusion where the eval primitives look interchangeable, prevent consumers from inventing redundant wrappers, and steer produced-state grading to the existing judge-composition point rather than a new runner. It also locks in the rule that deliverable bodies should travel in-band on events instead of being re-fetched from product databases.
  • Assessment: Good on its merits. The repo already uses standalone "map" docs (docs/self-improvement-map.md, docs/feature-guide.md) to disambiguate confusing surfaces, so a new eval-surface-map.md fits the grain. The produced-state guidance is grounded in the real substrate adapter at src/campaign/presets/playback.ts:67-103, which composes extractProducedState + verifyCompletion into a `runProfileMa
  • Better / existing approach: none — this is the right approach. A separate map doc is consistent with self-improvement-map.md, and neither docs/feature-guide.md nor docs/design/primitives-integration-spec.md already provides this primitive-by-primitive, use-when table or the produced-state wrapper warning. The produced-state guidance could in theory live in primitives-integration-spec.md, but the anti-wrapper narrativ

🎯 Usefulness — sound-with-nits

A docs-only clarification that is well-integrated via concepts.md/README.md and complements existing guides, with two minor accuracy nits on return-type shorthand and the proposal content contract.

  • Integration: The new doc is reachable: README.md:167 sends readers to docs/concepts.md, and concepts.md:186 now links to docs/eval-surface-map.md. It is not dead surface; it sits on the main docs path an adopter follows after install.
  • Fit with existing patterns: It fits the existing code and doc conventions. The mental model it documents (measure → factor → generate → gate) matches the implementation files: runCampaign/runEval in src/campaign/*, runProfileMatrix in src/campaign/presets/run-profile-matrix.ts, runOptimization/runImprovementLoop in src/campaign/presets/run-optimization.ts and run-improvement-loop.ts, runEvalCampaign in src/eval-campaign.ts,
  • Real-world viability: As documentation it has no runtime path to stress, and it correctly reflects the underlying primitives' real-backend integrity guard (runProfileMatrix.ts:17-29), cost ceiling (runAgentMatrix runner.ts:89), and holdout gate (run-improvement-loop.ts:97-111). One contract claim is slightly ahead of the code: the doc states proposal_created events carry content (eval-surface-map.md:51-53), but src/pro

🎯 Usefulness Audit

🟡 Return-type shorthand in the table oversimplifies two primitives [ergonomics] ``

eval-surface-map.md:14 says runProfileMatrix returns RunRecord[], but the function returns RunProfileMatrixResult (src/campaign/presets/run-profile-matrix.ts:170-185) which contains records: RunRecord[] plus byProfile/byScenario/integrity/campaigns. eval-surface-map.md:17 says runEvalCampaign returns CampaignResult + records, but it returns EvalCampaignResult (src/eval-campaign.ts:274-288) with runs: RunRecord[], integrityReports, failedRuns, etc., and no CampaignResult. Suggested

🟡 Proposal content contract is documented but not yet implemented in produced-state extraction [problem-fit] ``

eval-surface-map.md:51-53 asserts that proposal_created events carry content, but src/produced-state.ts:41-46 ProposalEventLike has no content field, and lines 97-99 map the event to { id, title, status } without copying content. The downstream ProducedProposal.content field exists (src/completion-verifier.ts:46-52) and proposalCandidates (completion-verifier.ts:211-229) will grade it when present, so the grader is ready but the extraction layer is not. Either add content to `Pro

💰 Value Audit

🟡 Return-type shorthand in the table is imprecise [maintenance] ``

The table says runProfileMatrix returns RunRecord[], but the actual return is RunProfileMatrixResult containing records, campaigns, integrity, etc. (src/campaign/presets/run-profile-matrix.ts:170-185). It also says runEvalCampaign returns CampaignResult + records, but the actual return is EvalCampaignResult with runs, integrityReports, report, etc. (src/eval-campaign.ts:274-288). Better to list the real result types or note "including ..." to avoid misleading adopters


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260614T113429Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 640bc017

Readiness 72/100 · Confidence 70/100 · 6 findings (3 medium, 3 low)

deepseek glm aggregate
Readiness 72 79 72
Confidence 70 70 70
Correctness 72 79 72
Security 72 79 72
Testing 72 79 72
Architecture 72 79 72

Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Doc claims proposal events carry content, but extraction code never captures it — docs/eval-surface-map.md

Line 52 states: 'proposal_created events carry content (the submit_proposal description) — same role, same field name.' This is factually wrong against the current code. ProposalEventLike (src/produced-state.ts:41-46) has no content field — only proposalId, title, status. extractProducedState (produced-state.ts:99) maps proposals as { id: p.proposalId, title: p.title, status: p.status ?? 'pending' } with no content. ProducedProposal.content exists as optional in completion-verifier.ts:51 but is NEVER populated by the extraction. Impact: an integrator following this doc will expect proposals to flow through verifyCompletion's correctness check (completion-verifier.t

🟠 MEDIUM runEvalCampaign returns EvalCampaignResult, not CampaignResult — docs/eval-surface-map.md

Line 17 table cell: 'CampaignResult + records'. Actual return is EvalCampaignResult (src/eval-campaign.ts:274) — a separate interface with no shared fields with CampaignResult: runs: RunRecord[], integrityReports, failedRuns, report, campaignFingerprint, preregistrationHash. EvalCampaignResult does not extend CampaignResult and has no manifestHash, seed, cells, aggregates, optimization, or gate fields. Fix: s/CampaignResult + records/EvalCampaignResult/.

🟠 MEDIUM runProfileMatrix return type is RunProfileMatrixResult, not RunRecord[] — docs/eval-surface-map.md

Line 14 table cell: 'RunRecord[]'. Actual return is Promise (src/campaign/presets/run-profile-matrix.ts:170). RunProfileMatrixResult wraps RunRecord[] under .records but also carries matrixId, experimentId, byProfile, byScenario, byPersona, integrity, and campaigns. A consumer following this doc would access the return as a bare array and miss structural metadata. Fix: s/RunRecord[]/RunProfileMatrixResult/.

🟡 LOW Nav bullet is markedly denser than its siblings — docs/concepts.md

Line 186 is ~3x longer than the surrounding 'Where to go next' bullets and embeds three jargon phrases ('verifyCompletion-as-judge', 'persona-dispatch wrapper', 'in-band body contract'). The other bullets (self-improvement-map, feature-guide, wire-protocol) each lead with a one-line plain-language hook. The density is defensible — it signposts the 'do not build a wrapper' guardrail at the nav layer — but a reader scanning the list for a quick orientation may stumble. No correctness impact; consider trimming to a hook + deferred detail if this list is meant for skim readers.

🟡 LOW runEvalCampaign return type described as CampaignResult but returns EvalCampaignResult — docs/eval-surface-map.md

Line 17 says runEvalCampaign returns 'CampaignResult + records', using backtick-wrapped CampaignResult (a concrete type). The actual return type is EvalCampaignResult (src/eval-campaign.ts:303-305), which (eval-campaign.ts:274-288) contains runs: RunRecord[], integrityReports: RunIntegrityReport[], failedRuns: FailedRun[], and optional report: ResearchReport. There is no CampaignResult field anywhere in EvalCampaignResult. Fix: change the table cell to 'EvalCampaignResult' or 'RunRecord[] + integrity reports' to match the actual type.

🟡 LOW verifyCompletion parameter names differ from source — docs/eval-surface-map.md

Line 36 pipeline diagram shows verifyCompletion(taskGold, state, correctnessChecker). Actual signature (src/completion-verifier.ts:259) uses gold (not taskGold) and checkCorrectness (not correctnessChecker). Cosmetic — no semantic difference — but the doc being wrong about the parameter names undermines its authority when a reader copy-pastes from it.


tangletools · 2026-06-14T11:37:47Z · trace

@drewstone drewstone merged commit 509c9fb into main Jun 14, 2026
1 check passed
drewstone added a commit that referenced this pull request Jun 14, 2026
…e map

Releases the merged-but-unpublished work:
- content-aware + ungameable produced-state completion grader (#248): require
  a body for a structural proposal match, strip deliverable-FORM vocabulary
  from requirement-side recall, anti-game fixtures.
- eval surface map doc (#249): pick-by-use-when for run* + the in-band
  produced-state contract; retires the phantom persona-dispatch wrapper.

Stricter verifyCompletion behavior — a notable minor; consumers adopt by bump.
drewstone added a commit that referenced this pull request Jun 14, 2026
…e map (#250)

Releases the merged-but-unpublished work:
- content-aware + ungameable produced-state completion grader (#248): require
  a body for a structural proposal match, strip deliverable-FORM vocabulary
  from requirement-side recall, anti-game fixtures.
- eval surface map doc (#249): pick-by-use-when for run* + the in-band
  produced-state contract; retires the phantom persona-dispatch wrapper.

Stricter verifyCompletion behavior — a notable minor; consumers adopt by bump.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants