Fix result sidecar identity for multi-target runs

## Problem

AgentV multi-target runs currently preserve target identity in `index.jsonl` metadata but can write different targets for the same `test_id` to the same per-case sidecar directory. Dogfood recorded both `mock-alpha` and `mock-beta` pointing at `case-one/run-1/*`; the later target overwrote the earlier target's output, grading, timing, metrics, and case summary artifacts.

This is tracked in Beads as `av-9vi`.

## Chosen design

AgentV should remain a superset of Vercel-style experiment naming, not hard-code Next/Vercel semantics.

Keep `experiment` as the campaign namespace in eval YAML and artifacts. Keep the timestamp as the invocation/batch folder. A multi-target CLI command should fan out into separate target/variant run bundles under the same timestamp rather than forcing all targets into one shared index.

Canonical layout:

```text
.agentv/results/<experiment>/<timestamp>/<target>/<variant?>/
  index.jsonl
  summary.json
  <row_id>/run-1/
  <row_id>/run-2/
```

The target/variant path is for storage isolation and manual browsing only. Dashboard and readers may walk folders to discover nested `index.jsonl` files, but target, variant, eval path, suite, test ID, sidecar paths, and display metadata must come from the manifest rows/run metadata rather than from folder-name parsing.

`row_id` should be stable, compact, and filesystem-safe, preferably:

```text
<safe_test_id>--<short_hash>
```

The hash input should include collision-prone row fields available inside that target/variant bundle, at minimum eval path/source eval identity, suite label, and test ID. Include target and variant when present in row metadata for compatibility. There should be no `rows/` parent directory unless implementation discovers a concrete need. Repeated attempts should remain `run-1`, `run-2`, etc.; do not add a new `attempts` concept.

## Scope

- Keep `.agentv/results/<experiment>/` as the comparison campaign namespace.
- Keep `experiment:` naming in eval YAML; do not rename it to `run_group`.
- Write one nested run bundle per target/variant under the timestamp.
- Preserve legacy root-level `index.jsonl` readability.
- Update Dashboard/result discovery to find nested bundle indexes under timestamp folders.
- Existing readers should follow `index.jsonl` path fields. Any reader that infers `<case>/run-N` should be fixed.
- Do not implement migration or special parsing for hypothetical `experiment--target`; AgentV does not currently support that layout.

## Likely code surface

- `packages/core/src/evaluation/run-artifacts.ts`, especially sidecar/result directory allocation.
- CLI eval artifact writer and run aggregation/resume identity.
- Result index/summary path fields for grading, metrics, outputs, transcripts, timing, and case summaries.
- Dashboard/result readers that resolve sidecar paths and compare run bundles.

## Acceptance criteria

- A multi-target CLI run writes separate nested target/variant bundles and no longer overwrites per-target artifacts.
- Each target/variant bundle has its own `index.jsonl`, `summary.json`, and deterministic row sidecar directories.
- Legacy root-level `index.jsonl` bundles remain readable.
- Dashboard discovers nested `index.jsonl` files under timestamp folders.
- Dashboard can list, open, drill into, and compare target/variant bundles under the same experiment.
- Dashboard uses row/run metadata for target, variant, eval path, suite, test ID, and sidecar semantics; it must not derive semantics by parsing target/variant folder names.
- Tests cover multi-target fan-out/isolation, same `test_id` across two suites, duplicate suite labels from different eval paths, variant when existing plumbing exposes variant, and metadata-not-path target/variant behavior.
- Remote-sync dogfood verifies nested bundles pushed to `agentv/results/v1` can be loaded after clearing local materialized results/cache, including fresh branch creation.
- Browser UAT verifies a project with nested target/variant bundles loads in Dashboard and compare/drilldown works.
- Live eval dogfood with a real provider and real LLM grader is run before ready-for-review, or any blocker is explicitly recorded with commands/evidence.
- ADR/docs clarify that AgentV supports Vercel-style experiment naming and structured target/variant dimensions as a superset.

## References

- Bead: `av-9vi`
- ADR PR: https://github.com/EntityProcess/agentv/pull/1556
- Research beads: `av-770`, `av-e49`, `av-74h`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix result sidecar identity for multi-target runs #1557

Problem

Chosen design

Scope

Likely code surface

Acceptance criteria

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Fix result sidecar identity for multi-target runs #1557

Description

Problem

Chosen design

Scope

Likely code surface

Acceptance criteria

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions