Skip to content

Fix result sidecar identity for multi-target runs #1557

Description

@christso

Problem

AgentV multi-target runs currently preserve target identity in index.jsonl metadata but can write different targets for the same test_id to the same per-case sidecar directory. Dogfood recorded both mock-alpha and mock-beta pointing at case-one/run-1/*; the later target overwrote the earlier target's output, grading, timing, metrics, and case summary artifacts.

This is tracked in Beads as av-9vi.

Chosen design

AgentV should remain a superset of Vercel-style experiment naming, not hard-code Next/Vercel semantics.

Keep experiment as the campaign namespace in eval YAML and artifacts. Keep the timestamp as the invocation/batch folder. A multi-target CLI command should fan out into separate target/variant run bundles under the same timestamp rather than forcing all targets into one shared index.

Canonical layout:

.agentv/results/<experiment>/<timestamp>/<target>/<variant?>/
  index.jsonl
  summary.json
  <row_id>/run-1/
  <row_id>/run-2/

The target/variant path is for storage isolation and manual browsing only. Dashboard and readers may walk folders to discover nested index.jsonl files, but target, variant, eval path, suite, test ID, sidecar paths, and display metadata must come from the manifest rows/run metadata rather than from folder-name parsing.

row_id should be stable, compact, and filesystem-safe, preferably:

<safe_test_id>--<short_hash>

The hash input should include collision-prone row fields available inside that target/variant bundle, at minimum eval path/source eval identity, suite label, and test ID. Include target and variant when present in row metadata for compatibility. There should be no rows/ parent directory unless implementation discovers a concrete need. Repeated attempts should remain run-1, run-2, etc.; do not add a new attempts concept.

Scope

  • Keep .agentv/results/<experiment>/ as the comparison campaign namespace.
  • Keep experiment: naming in eval YAML; do not rename it to run_group.
  • Write one nested run bundle per target/variant under the timestamp.
  • Preserve legacy root-level index.jsonl readability.
  • Update Dashboard/result discovery to find nested bundle indexes under timestamp folders.
  • Existing readers should follow index.jsonl path fields. Any reader that infers <case>/run-N should be fixed.
  • Do not implement migration or special parsing for hypothetical experiment--target; AgentV does not currently support that layout.

Likely code surface

  • packages/core/src/evaluation/run-artifacts.ts, especially sidecar/result directory allocation.
  • CLI eval artifact writer and run aggregation/resume identity.
  • Result index/summary path fields for grading, metrics, outputs, transcripts, timing, and case summaries.
  • Dashboard/result readers that resolve sidecar paths and compare run bundles.

Acceptance criteria

  • A multi-target CLI run writes separate nested target/variant bundles and no longer overwrites per-target artifacts.
  • Each target/variant bundle has its own index.jsonl, summary.json, and deterministic row sidecar directories.
  • Legacy root-level index.jsonl bundles remain readable.
  • Dashboard discovers nested index.jsonl files under timestamp folders.
  • Dashboard can list, open, drill into, and compare target/variant bundles under the same experiment.
  • Dashboard uses row/run metadata for target, variant, eval path, suite, test ID, and sidecar semantics; it must not derive semantics by parsing target/variant folder names.
  • Tests cover multi-target fan-out/isolation, same test_id across two suites, duplicate suite labels from different eval paths, variant when existing plumbing exposes variant, and metadata-not-path target/variant behavior.
  • Remote-sync dogfood verifies nested bundles pushed to agentv/results/v1 can be loaded after clearing local materialized results/cache, including fresh branch creation.
  • Browser UAT verifies a project with nested target/variant bundles loads in Dashboard and compare/drilldown works.
  • Live eval dogfood with a real provider and real LLM grader is run before ready-for-review, or any blocker is explicitly recorded with commands/evidence.
  • ADR/docs clarify that AgentV supports Vercel-style experiment naming and structured target/variant dimensions as a superset.

References

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

Status
In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions