From 8890e3622b7a3eceafe1b8698ddce2907369d4ce Mon Sep 17 00:00:00 2001
From: Christopher Tso <christso@gmail.com>
Date: Mon, 29 Jun 2026 01:14:30 +0200
Subject: [PATCH 1/3] docs(adr): clarify result row sidecar identity

---
 ...arate-experiments-from-eval-definitions.md | 27 ++++---
 ...-result-identity-and-default-experiment.md | 75 +++++++++++++++++--
 2 files changed, 84 insertions(+), 18 deletions(-)
diff --git a/docs/adr/0006-separate-experiments-from-eval-definitions.md b/docs/adr/0006-separate-experiments-from-eval-definitions.md
index e4ad436ad..c5b176469 100644
--- a/docs/adr/0006-separate-experiments-from-eval-definitions.md
+++ b/docs/adr/0006-separate-experiments-from-eval-definitions.md
@@ -371,28 +371,31 @@ This is the motivating distinction:
 
 ## Result Layout
 
-The canonical writer path is:
+ADR 0009 is the source of truth for result experiment bucket precedence, row
+identity, and sidecar path allocation. The canonical run bundle root is:
 
 ```text
-.agentv/results/<eval-name>/<timestamp>/...
+.agentv/results/<experiment>/<timestamp>/...
 ```
 
 There is no `.agentv/results/runs` segment in canonical writer output. There is
-also no default nested suite segment when the result group is already the same
-eval suite being run directly.
+also no schema-significant suite or imported-suite directory segment.
 
-If a wrapper eval imports another suite with `type: suite`, test artifacts from
-that imported suite are nested under the imported suite identity:
+Within the timestamp, `index.jsonl` is authoritative for row identity and all
+run-relative sidecar paths. New writers should allocate deterministic row
+directories directly under the timestamp:
 
 ```text
-.agentv/results/<wrapper-eval-name>/<timestamp>/<imported-suite-name>/<test-id>/...
+.agentv/results/<experiment>/<timestamp>/<row_id>/run-1/...
 ```
 
-The suite segment is required for imported suites because wrapper evals can
-compose many suites with overlapping test IDs, and the directory tree should
-remain inspectable without reading every manifest row. Test artifacts from tests
-owned directly by the wrapper eval can still live directly under `<test-id>`.
-All cases should also retain source suite metadata in manifests and index rows.
+`row_id` is a stable, filesystem-safe allocation such as
+`<safe_test_id>--<short_hash>`. The hash input includes the source eval identity
+or `eval_path`, suite label, `test_id`, target, and variant. This keeps
+same-`test_id` rows from different suites, duplicate suite labels, targets, and
+variants from overwriting each other without making path hierarchy a semantic
+contract. There is no required `rows/` parent directory. Source suite metadata
+still belongs in manifests and index rows.
 
 The result namespace remains `experiment` in artifacts and Dashboard. AgentV
 should not introduce a separate authored `run_group` field. For better DX,
diff --git a/docs/adr/0009-eval-path-result-identity-and-default-experiment.md b/docs/adr/0009-eval-path-result-identity-and-default-experiment.md
index c48fb0702..f5deb1fb7 100644
--- a/docs/adr/0009-eval-path-result-identity-and-default-experiment.md
+++ b/docs/adr/0009-eval-path-result-identity-and-default-experiment.md
@@ -23,6 +23,16 @@ or suite names. That creates unstable routing when a single CLI invocation runs
 multiple eval files, imports suites with overlapping case IDs, or changes
 display metadata without changing the task under evaluation.
 
+Follow-up dogfood in bead `av-770` found a concrete bug in that direction:
+multi-target AgentV runs stored different targets for the same `test_id` under
+the same `<case>/run-N` sidecar directory. The second target overwrote the
+first target's output, grading, timing, metrics, and case summary artifacts.
+Related research in beads `av-74h` and `av-e49` compared Vercel `agent-eval`,
+Vercel `next-evals-oss`, and Margin Evals. Those systems confirm that
+frameworks can either encode model/variant in experiment names or keep run
+manifests as the truth, but none requires AgentV to derive artifact paths from
+suite names.
+
 The final contract keeps authoring and storage separate:
 
 - `eval.yaml` remains the authored experiment spec.
@@ -30,6 +40,8 @@ The final contract keeps authoring and storage separate:
 - per-row source identity is stored in `index.jsonl`;
 - `suite` and `name` remain display metadata only;
 - path discovery comes from the run manifest, not from folder conventions.
+- per-row sidecar directories are stable storage allocations, not semantic
+  routing keys.
 
 ## Decision
 
@@ -55,13 +67,13 @@ suite names, numbers of input eval files, or multi-eval wrapper shapes.
 Each row in `index.jsonl` is identified by:
 
 ```text
-eval_path + test_id + target
+eval_path + test_id + target + variant
 ```
 
 `eval_path` is the source eval file path relative to the repo root or run
 source root. Dashboard and other readers should display this value as `Eval`.
-They should also display `test_id` and `target` so users can distinguish rows
-with overlapping test IDs.
+They should also display `test_id`, `target`, and `variant` when present so
+users can distinguish rows with overlapping test IDs.
 
 `suite` and `name` are display metadata. They may help humans group or label
 results, but they must not drive storage, routing, Dashboard detail selection,
@@ -74,9 +86,55 @@ directories are exposed with `result_dir`. Sidecar paths such as `task_dir`,
 Consumers must use these fields instead of reconstructing paths from
 `suite`, `name`, `test_id`, or `target`.
 
-`result_dir` is an opaque run-local allocation. It should stay readable when
-that does not compromise uniqueness, but implementations may suffix or allocate
-otherwise to avoid collisions. Its value is not the public identity of the row.
+`result_dir` is an opaque run-local allocation. For newly written artifacts, the
+preferred allocation is a deterministic row directory directly under the
+timestamp:
+
+```text
+.agentv/results/<experiment>/<timestamp>/
+  index.jsonl
+  summary.json
+  <row_id>/run-1/
+  <row_id>/run-2/
+```
+
+There is no required `rows/` parent directory. `row_id` should be stable,
+filesystem-safe, compact, and readable enough for humans to scan, for example:
+
+```text
+<safe_test_id>--<short_hash>
+```
+
+The visible `test_id` prefix is only a convenience. The hash input must include
+the collision-prone row fields available at write time: `eval_path` or source
+eval identity, `suite` label, `test_id`, `target`, and `variant`. `eval_path`
+or equivalent source identity is what prevents duplicate suite names from
+colliding; the suite label alone is never a uniqueness boundary. If future row
+identity gains another axis, that axis must be included in the hash before it
+can affect sidecar allocation.
+
+This row-id allocation is intentionally simpler than conditional path
+disambiguation. It avoids special cases for same `test_id` across suites,
+duplicate suite labels, multi-target runs, and target variants. Existing run
+bundles remain readable because `index.jsonl` already records explicit
+run-relative paths; any consumer that infers `<case>/run-N` paths instead of
+following `index.jsonl` is depending on an implementation detail and should be
+fixed.
+
+Reference alternatives considered:
+
+- Vercel `agent-eval` expands model arrays into experiment paths such as
+  `<config>/<model>/<timestamp>/<case>/run-N`. This works for
+  model-as-experiment publication but fragments one multi-target invocation and
+  lets provider names with slashes become path hierarchy.
+- Vercel `next-evals-oss` uses one experiment file per model and `--agents-md`
+  variant, then pairs variants during export. AgentV should allow that style by
+  experiment naming for published baselines, but not require it for ordinary
+  multi-target runs.
+- Margin Evals writes one output run directory with result manifests and
+  instance artifacts, without an AgentV-style experiment bucket. That validates
+  manifest-first storage, but dropping AgentV's experiment bucket is a larger
+  semantic change than this bug fix needs.
 
 ## Consequences
 
@@ -91,6 +149,10 @@ otherwise to avoid collisions. Its value is not the public identity of the row.
   long as their `eval_path` values differ.
 - Import, rerun, Dashboard, comparison, and export tools can load a run from
   `index.jsonl` without needing source checkout conventions.
+- Multi-target and variant runs do not need to become multiple experiments just
+  to avoid sidecar collisions.
+- New sidecar paths may not resemble the case hierarchy, which is acceptable
+  because `index.jsonl` is the contract for discovery and display.
 
 ## Non-Goals
 
@@ -99,4 +161,5 @@ otherwise to avoid collisions. Its value is not the public identity of the row.
 - Hashing eval paths into default experiment names.
 - Creating automatic `multi-eval` experiment names.
 - Making `result_dir` a semantic folder contract.
+- Adding a `rows/` directory segment without a concrete implementation need.
 - Removing compatibility readers for older run bundles in this ADR.

From 0456fa609b66c0f372b3b995150992301a7c9cc2 Mon Sep 17 00:00:00 2001
From: Christopher Tso <christso@gmail.com>
Date: Mon, 29 Jun 2026 08:43:30 +0200
Subject: [PATCH 2/3] docs(adr): align row identity with target bundles

---
 ...arate-experiments-from-eval-definitions.md | 43 +++++++----
 ...-result-identity-and-default-experiment.md | 74 ++++++++++++-------
 2 files changed, 78 insertions(+), 39 deletions(-)

diff --git a/docs/adr/0006-separate-experiments-from-eval-definitions.md b/docs/adr/0006-separate-experiments-from-eval-definitions.md
index c5b176469..0cee8caea 100644
--- a/docs/adr/0006-separate-experiments-from-eval-definitions.md
+++ b/docs/adr/0006-separate-experiments-from-eval-definitions.md
@@ -31,7 +31,9 @@ The final design keeps the product boundary smaller:
 - `eval.yaml` is the only runnable authoring artifact.
 - `experiment:` is an inline run-time block inside `eval.yaml`.
 - `tests:` is the composition, import, and selection surface.
-- result bundles are written under `.agentv/results/<eval-name>/<timestamp>/`.
+- result invocations are written under
+  `.agentv/results/<experiment>/<timestamp>/`, with target/variant bundles below
+  the timestamp.
 - A directory named `experiments/` may be used as a user-owned repo convention
   for wrapper eval YAML files, but it does not create a separate experiment
   artifact type or schema-significant path.
@@ -371,8 +373,9 @@ This is the motivating distinction:
 
 ## Result Layout
 
-ADR 0009 is the source of truth for result experiment bucket precedence, row
-identity, and sidecar path allocation. The canonical run bundle root is:
+ADR 0009 is the source of truth for result experiment bucket precedence,
+target/variant bundle layout, row identity, and sidecar path allocation. The
+canonical invocation directory is:
 
 ```text
 .agentv/results/<experiment>/<timestamp>/...
@@ -381,21 +384,33 @@ identity, and sidecar path allocation. The canonical run bundle root is:
 There is no `.agentv/results/runs` segment in canonical writer output. There is
 also no schema-significant suite or imported-suite directory segment.
 
-Within the timestamp, `index.jsonl` is authoritative for row identity and all
-run-relative sidecar paths. New writers should allocate deterministic row
-directories directly under the timestamp:
+Within the timestamp, new writers fan out into a target and optional variant
+bundle. Each target/variant bundle has its own `index.jsonl` and `summary.json`:
 
 ```text
-.agentv/results/<experiment>/<timestamp>/<row_id>/run-1/...
+.agentv/results/<experiment>/<timestamp>/<target>/<variant?>/
+  index.jsonl
+  summary.json
+  <row_id>/run-1/
+  <row_id>/run-2/
 ```
 
-`row_id` is a stable, filesystem-safe allocation such as
-`<safe_test_id>--<short_hash>`. The hash input includes the source eval identity
-or `eval_path`, suite label, `test_id`, target, and variant. This keeps
-same-`test_id` rows from different suites, duplicate suite labels, targets, and
-variants from overwriting each other without making path hierarchy a semantic
-contract. There is no required `rows/` parent directory. Source suite metadata
-still belongs in manifests and index rows.
+The target/variant folder split is for storage isolation and manual browsing
+only. Dashboard, import, rerun, comparison, and export readers discover nested
+`index.jsonl` files, then use bundle summary and row metadata for `target`,
+`variant`, `eval_path`, `suite`, and `test_id` semantics. Legacy timestamp-level
+bundles with `index.jsonl` directly under
+`.agentv/results/<experiment>/<timestamp>/` remain readable.
+
+Inside each target/variant bundle, `index.jsonl` is authoritative for row
+identity and all bundle-relative sidecar paths. `row_id` directories are stable,
+filesystem-safe allocations such as `<safe_test_id>--<short_hash>`, and repeated
+runs live under `run-1`, `run-2`, and so on. The hash input includes the source
+eval identity or `eval_path`, suite label, `test_id`, target, and variant. This
+keeps same-`test_id` rows from different suites, duplicate suite labels, targets,
+and variants from overwriting each other without making path hierarchy a
+semantic contract. There is no required `rows/` parent directory. Source suite
+metadata still belongs in manifests and index rows.
 
 The result namespace remains `experiment` in artifacts and Dashboard. AgentV
 should not introduce a separate authored `run_group` field. For better DX,
diff --git a/docs/adr/0009-eval-path-result-identity-and-default-experiment.md b/docs/adr/0009-eval-path-result-identity-and-default-experiment.md
index f5deb1fb7..c77baa0a8 100644
--- a/docs/adr/0009-eval-path-result-identity-and-default-experiment.md
+++ b/docs/adr/0009-eval-path-result-identity-and-default-experiment.md
@@ -14,8 +14,8 @@ composition.
 ## Context
 
 AgentV needs one simple result identity contract that works for direct eval
-runs, imported evals, repeated attempts, Dashboard inspection, and downstream
-tools that consume portable run bundles.
+runs, imported evals, repeated runs, Dashboard inspection, and downstream tools
+that consume portable run bundles.
 
 The previous same-week direction kept `eval.yaml` as the authored experiment
 spec, but it still let result buckets and per-case paths be inferred from eval
@@ -25,7 +25,7 @@ display metadata without changing the task under evaluation.
 
 Follow-up dogfood in bead `av-770` found a concrete bug in that direction:
 multi-target AgentV runs stored different targets for the same `test_id` under
-the same `<case>/run-N` sidecar directory. The second target overwrote the
+the same `<case>/run-1` sidecar directory. The second target overwrote the
 first target's output, grading, timing, metrics, and case summary artifacts.
 Related research in beads `av-74h` and `av-e49` compared Vercel `agent-eval`,
 Vercel `next-evals-oss`, and Margin Evals. Those systems confirm that
@@ -36,7 +36,9 @@ suite names.
 The final contract keeps authoring and storage separate:
 
 - `eval.yaml` remains the authored experiment spec.
-- a CLI invocation produces one timestamped run bundle;
+- a CLI invocation produces one timestamped invocation directory;
+- each target and optional variant in that invocation gets an isolated result
+  bundle under the timestamp;
 - per-row source identity is stored in `index.jsonl`;
 - `suite` and `name` remain display metadata only;
 - path discovery comes from the run manifest, not from folder conventions.
@@ -45,12 +47,31 @@ The final contract keeps authoring and storage separate:
 
 ## Decision
 
-One AgentV CLI invocation writes one run bundle under:
+One AgentV CLI invocation writes one timestamped invocation directory under:
 
 ```text
 .agentv/results/<experiment>/<timestamp>/
 ```
 
+`experiment` remains the campaign namespace. The timestamp is the
+invocation/batch folder for that CLI run. Within the timestamp, new writers fan
+out into one bundle per target and optional variant:
+
+```text
+.agentv/results/<experiment>/<timestamp>/<target>/<variant?>/
+  index.jsonl
+  summary.json
+  <row_id>/run-1/
+  <row_id>/run-2/
+```
+
+The `<target>/<variant?>` folder split is only storage isolation and manual
+browsing structure. It is not the semantic source for target or variant. Readers
+discover nested `index.jsonl` files, then use loaded summary and row metadata for
+`target`, `variant`, `eval_path`, `suite`, and `test_id` semantics. Legacy
+timestamp-level bundles that keep `index.jsonl` and `summary.json` directly
+under `.agentv/results/<experiment>/<timestamp>/` remain readable.
+
 The result experiment bucket is selected in this order:
 
 1. the explicit CLI `--experiment` value;
@@ -79,19 +100,19 @@ users can distinguish rows with overlapping test IDs.
 results, but they must not drive storage, routing, Dashboard detail selection,
 rerun lookup, import identity, or artifact discovery.
 
-`index.jsonl` is authoritative for all run-relative artifact paths. Per-row
+`index.jsonl` is authoritative for all bundle-relative artifact paths. Per-row
 directories are exposed with `result_dir`. Sidecar paths such as `task_dir`,
 `summary_path`, `grading_path`, `metrics_path`, `transcript_path`,
 `targets_path`, `files_path`, and `graders_path` are explicit manifest fields.
 Consumers must use these fields instead of reconstructing paths from
-`suite`, `name`, `test_id`, or `target`.
+`suite`, `name`, `test_id`, `target`, `variant`, or target/variant folder names.
 
-`result_dir` is an opaque run-local allocation. For newly written artifacts, the
-preferred allocation is a deterministic row directory directly under the
-timestamp:
+`result_dir` is an opaque bundle-local allocation. For newly written artifacts,
+the preferred allocation is a deterministic row directory inside the
+target/variant bundle:
 
 ```text
-.agentv/results/<experiment>/<timestamp>/
+.agentv/results/<experiment>/<timestamp>/<target>/<variant?>/
   index.jsonl
   summary.json
   <row_id>/run-1/
@@ -115,22 +136,24 @@ can affect sidecar allocation.
 
 This row-id allocation is intentionally simpler than conditional path
 disambiguation. It avoids special cases for same `test_id` across suites,
-duplicate suite labels, multi-target runs, and target variants. Existing run
-bundles remain readable because `index.jsonl` already records explicit
-run-relative paths; any consumer that infers `<case>/run-N` paths instead of
-following `index.jsonl` is depending on an implementation detail and should be
-fixed.
+duplicate suite labels, multi-target runs, and target variants while keeping the
+multi-target CLI under one timestamped invocation directory. Existing run
+bundles remain readable because `index.jsonl` already records explicit artifact
+paths; any consumer that infers `<case>/run-1` paths or semantic target
+information from folders instead of following `index.jsonl` is depending on an
+implementation detail and should be fixed.
 
 Reference alternatives considered:
 
 - Vercel `agent-eval` expands model arrays into experiment paths such as
-  `<config>/<model>/<timestamp>/<case>/run-N`. This works for
+  `<config>/<model>/<timestamp>/<case>/run-1`. This works for
   model-as-experiment publication but fragments one multi-target invocation and
   lets provider names with slashes become path hierarchy.
 - Vercel `next-evals-oss` uses one experiment file per model and `--agents-md`
   variant, then pairs variants during export. AgentV should allow that style by
-  experiment naming for published baselines, but not require it for ordinary
-  multi-target runs.
+  experiment naming for published baselines, but AgentV remains a superset of
+  Vercel-style naming and must not hard-code Next or Vercel semantics for
+  ordinary multi-target runs.
 - Margin Evals writes one output run directory with result manifests and
   instance artifacts, without an AgentV-style experiment bucket. That validates
   manifest-first storage, but dropping AgentV's experiment bucket is a larger
@@ -139,18 +162,19 @@ Reference alternatives considered:
 ## Consequences
 
 - A direct run such as `agentv eval evals/a.eval.yaml evals/b.eval.yaml`
-  produces one timestamped bundle unless the user explicitly runs separate CLI
-  commands.
+  produces one timestamped invocation directory unless the user explicitly runs
+  separate CLI commands.
 - The default no-config path is stable:
   `.agentv/results/default/<timestamp>/`.
 - Renaming a suite or display name does not move prior results or change
   Dashboard routing identity.
 - Multiple eval files can share the same `test_id` and suite display name as
   long as their `eval_path` values differ.
-- Import, rerun, Dashboard, comparison, and export tools can load a run from
-  `index.jsonl` without needing source checkout conventions.
-- Multi-target and variant runs do not need to become multiple experiments just
-  to avoid sidecar collisions.
+- Import, rerun, Dashboard, comparison, and export tools can load runs by
+  discovering nested `index.jsonl` manifests without needing source checkout
+  conventions or folder-name semantics.
+- Multi-target and variant runs do not need to become multiple experiments or
+  separate timestamped invocations just to avoid sidecar collisions.
 - New sidecar paths may not resemble the case hierarchy, which is acceptable
   because `index.jsonl` is the contract for discovery and display.
 

From e2867cc0445645a0c67c494147f08578d9636d9f Mon Sep 17 00:00:00 2001
From: Christopher Tso <christso@gmail.com>
Date: Mon, 29 Jun 2026 10:51:33 +0200
Subject: [PATCH 3/3] docs(adr): clarify suite row-id role

---
 .../0009-eval-path-result-identity-and-default-experiment.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/docs/adr/0009-eval-path-result-identity-and-default-experiment.md b/docs/adr/0009-eval-path-result-identity-and-default-experiment.md
index c77baa0a8..9ce000bbb 100644
--- a/docs/adr/0009-eval-path-result-identity-and-default-experiment.md
+++ b/docs/adr/0009-eval-path-result-identity-and-default-experiment.md
@@ -97,8 +97,9 @@ They should also display `test_id`, `target`, and `variant` when present so
 users can distinguish rows with overlapping test IDs.
 
 `suite` and `name` are display metadata. They may help humans group or label
-results, but they must not drive storage, routing, Dashboard detail selection,
-rerun lookup, import identity, or artifact discovery.
+results, and `suite` may participate in opaque row-id collision avoidance, but
+they must not drive visible storage hierarchy, semantic routing, Dashboard
+detail selection, rerun lookup, import identity, or artifact discovery.
 
 `index.jsonl` is authoritative for all bundle-relative artifact paths. Per-row
 directories are exposed with `result_dir`. Sidecar paths such as `task_dir`,