Problem
AgentV multi-target runs currently preserve target identity in index.jsonl metadata but can write different targets for the same test_id to the same per-case sidecar directory. Dogfood recorded both mock-alpha and mock-beta pointing at case-one/run-1/*; the later target overwrote the earlier target's output, grading, timing, metrics, and case summary artifacts.
This is tracked in Beads as av-9vi.
Chosen design
AgentV should remain a superset of Vercel-style experiment naming, not hard-code Next/Vercel semantics.
Keep experiment as the campaign namespace in eval YAML and artifacts. Keep the timestamp as the invocation/batch folder. A multi-target CLI command should fan out into separate target/variant run bundles under the same timestamp rather than forcing all targets into one shared index.
Canonical layout:
.agentv/results/<experiment>/<timestamp>/<target>/<variant?>/
index.jsonl
summary.json
<row_id>/run-1/
<row_id>/run-2/
The target/variant path is for storage isolation and manual browsing only. Dashboard and readers may walk folders to discover nested index.jsonl files, but target, variant, eval path, suite, test ID, sidecar paths, and display metadata must come from the manifest rows/run metadata rather than from folder-name parsing.
row_id should be stable, compact, and filesystem-safe, preferably:
<safe_test_id>--<short_hash>
The hash input should include collision-prone row fields available inside that target/variant bundle, at minimum eval path/source eval identity, suite label, and test ID. Include target and variant when present in row metadata for compatibility. There should be no rows/ parent directory unless implementation discovers a concrete need. Repeated attempts should remain run-1, run-2, etc.; do not add a new attempts concept.
Scope
- Keep
.agentv/results/<experiment>/ as the comparison campaign namespace.
- Keep
experiment: naming in eval YAML; do not rename it to run_group.
- Write one nested run bundle per target/variant under the timestamp.
- Preserve legacy root-level
index.jsonl readability.
- Update Dashboard/result discovery to find nested bundle indexes under timestamp folders.
- Existing readers should follow
index.jsonl path fields. Any reader that infers <case>/run-N should be fixed.
- Do not implement migration or special parsing for hypothetical
experiment--target; AgentV does not currently support that layout.
Likely code surface
packages/core/src/evaluation/run-artifacts.ts, especially sidecar/result directory allocation.
- CLI eval artifact writer and run aggregation/resume identity.
- Result index/summary path fields for grading, metrics, outputs, transcripts, timing, and case summaries.
- Dashboard/result readers that resolve sidecar paths and compare run bundles.
Acceptance criteria
- A multi-target CLI run writes separate nested target/variant bundles and no longer overwrites per-target artifacts.
- Each target/variant bundle has its own
index.jsonl, summary.json, and deterministic row sidecar directories.
- Legacy root-level
index.jsonl bundles remain readable.
- Dashboard discovers nested
index.jsonl files under timestamp folders.
- Dashboard can list, open, drill into, and compare target/variant bundles under the same experiment.
- Dashboard uses row/run metadata for target, variant, eval path, suite, test ID, and sidecar semantics; it must not derive semantics by parsing target/variant folder names.
- Tests cover multi-target fan-out/isolation, same
test_id across two suites, duplicate suite labels from different eval paths, variant when existing plumbing exposes variant, and metadata-not-path target/variant behavior.
- Remote-sync dogfood verifies nested bundles pushed to
agentv/results/v1 can be loaded after clearing local materialized results/cache, including fresh branch creation.
- Browser UAT verifies a project with nested target/variant bundles loads in Dashboard and compare/drilldown works.
- Live eval dogfood with a real provider and real LLM grader is run before ready-for-review, or any blocker is explicitly recorded with commands/evidence.
- ADR/docs clarify that AgentV supports Vercel-style experiment naming and structured target/variant dimensions as a superset.
References
Problem
AgentV multi-target runs currently preserve target identity in
index.jsonlmetadata but can write different targets for the sametest_idto the same per-case sidecar directory. Dogfood recorded bothmock-alphaandmock-betapointing atcase-one/run-1/*; the later target overwrote the earlier target's output, grading, timing, metrics, and case summary artifacts.This is tracked in Beads as
av-9vi.Chosen design
AgentV should remain a superset of Vercel-style experiment naming, not hard-code Next/Vercel semantics.
Keep
experimentas the campaign namespace in eval YAML and artifacts. Keep the timestamp as the invocation/batch folder. A multi-target CLI command should fan out into separate target/variant run bundles under the same timestamp rather than forcing all targets into one shared index.Canonical layout:
The target/variant path is for storage isolation and manual browsing only. Dashboard and readers may walk folders to discover nested
index.jsonlfiles, but target, variant, eval path, suite, test ID, sidecar paths, and display metadata must come from the manifest rows/run metadata rather than from folder-name parsing.row_idshould be stable, compact, and filesystem-safe, preferably:The hash input should include collision-prone row fields available inside that target/variant bundle, at minimum eval path/source eval identity, suite label, and test ID. Include target and variant when present in row metadata for compatibility. There should be no
rows/parent directory unless implementation discovers a concrete need. Repeated attempts should remainrun-1,run-2, etc.; do not add a newattemptsconcept.Scope
.agentv/results/<experiment>/as the comparison campaign namespace.experiment:naming in eval YAML; do not rename it torun_group.index.jsonlreadability.index.jsonlpath fields. Any reader that infers<case>/run-Nshould be fixed.experiment--target; AgentV does not currently support that layout.Likely code surface
packages/core/src/evaluation/run-artifacts.ts, especially sidecar/result directory allocation.Acceptance criteria
index.jsonl,summary.json, and deterministic row sidecar directories.index.jsonlbundles remain readable.index.jsonlfiles under timestamp folders.test_idacross two suites, duplicate suite labels from different eval paths, variant when existing plumbing exposes variant, and metadata-not-path target/variant behavior.agentv/results/v1can be loaded after clearing local materialized results/cache, including fresh branch creation.References
av-9viav-770,av-e49,av-74h