EntityProcess · christso · Jun 29, 2026 · Jun 28, 2026 · Jun 29, 2026 · Jun 29, 2026
diff --git a/docs/adr/0006-separate-experiments-from-eval-definitions.md b/docs/adr/0006-separate-experiments-from-eval-definitions.md
@@ -31,7 +31,9 @@ The final design keeps the product boundary smaller:
 - `eval.yaml` is the only runnable authoring artifact.
 - `experiment:` is an inline run-time block inside `eval.yaml`.
 - `tests:` is the composition, import, and selection surface.
-- result bundles are written under `.agentv/results/<eval-name>/<timestamp>/`.
+- result invocations are written under
+  `.agentv/results/<experiment>/<timestamp>/`, with target/variant bundles below
+  the timestamp.
 - A directory named `experiments/` may be used as a user-owned repo convention
   for wrapper eval YAML files, but it does not create a separate experiment
   artifact type or schema-significant path.
@@ -371,28 +373,44 @@ This is the motivating distinction:
 
 ## Result Layout
 
-The canonical writer path is:
+ADR 0009 is the source of truth for result experiment bucket precedence,
+target/variant bundle layout, row identity, and sidecar path allocation. The
+canonical invocation directory is:
 
 ```text
-.agentv/results/<eval-name>/<timestamp>/...
+.agentv/results/<experiment>/<timestamp>/...
 ```
 
 There is no `.agentv/results/runs` segment in canonical writer output. There is
-also no default nested suite segment when the result group is already the same
-eval suite being run directly.
+also no schema-significant suite or imported-suite directory segment.
 
-If a wrapper eval imports another suite with `type: suite`, test artifacts from
-that imported suite are nested under the imported suite identity:
+Within the timestamp, new writers fan out into a target and optional variant
+bundle. Each target/variant bundle has its own `index.jsonl` and `summary.json`:
 
 ```text
-.agentv/results/<wrapper-eval-name>/<timestamp>/<imported-suite-name>/<test-id>/...
+.agentv/results/<experiment>/<timestamp>/<target>/<variant?>/
+  index.jsonl
+  summary.json
+  <row_id>/run-1/
+  <row_id>/run-2/
 ```
 
-The suite segment is required for imported suites because wrapper evals can
-compose many suites with overlapping test IDs, and the directory tree should
-remain inspectable without reading every manifest row. Test artifacts from tests
-owned directly by the wrapper eval can still live directly under `<test-id>`.
-All cases should also retain source suite metadata in manifests and index rows.
+The target/variant folder split is for storage isolation and manual browsing
+only. Dashboard, import, rerun, comparison, and export readers discover nested
+`index.jsonl` files, then use bundle summary and row metadata for `target`,
+`variant`, `eval_path`, `suite`, and `test_id` semantics. Legacy timestamp-level
+bundles with `index.jsonl` directly under
+`.agentv/results/<experiment>/<timestamp>/` remain readable.
+
+Inside each target/variant bundle, `index.jsonl` is authoritative for row
+identity and all bundle-relative sidecar paths. `row_id` directories are stable,
+filesystem-safe allocations such as `<safe_test_id>--<short_hash>`, and repeated
+runs live under `run-1`, `run-2`, and so on. The hash input includes the source
+eval identity or `eval_path`, suite label, `test_id`, target, and variant. This
+keeps same-`test_id` rows from different suites, duplicate suite labels, targets,
+and variants from overwriting each other without making path hierarchy a
+semantic contract. There is no required `rows/` parent directory. Source suite
+metadata still belongs in manifests and index rows.
 
 The result namespace remains `experiment` in artifacts and Dashboard. AgentV
 should not introduce a separate authored `run_group` field. For better DX,

diff --git a/docs/adr/0009-eval-path-result-identity-and-default-experiment.md b/docs/adr/0009-eval-path-result-identity-and-default-experiment.md
@@ -14,31 +14,64 @@ composition.
 ## Context
 
 AgentV needs one simple result identity contract that works for direct eval
-runs, imported evals, repeated attempts, Dashboard inspection, and downstream
-tools that consume portable run bundles.
+runs, imported evals, repeated runs, Dashboard inspection, and downstream tools
+that consume portable run bundles.
 
 The previous same-week direction kept `eval.yaml` as the authored experiment
 spec, but it still let result buckets and per-case paths be inferred from eval
 or suite names. That creates unstable routing when a single CLI invocation runs
 multiple eval files, imports suites with overlapping case IDs, or changes
 display metadata without changing the task under evaluation.
 
+Follow-up dogfood in bead `av-770` found a concrete bug in that direction:
+multi-target AgentV runs stored different targets for the same `test_id` under
+the same `<case>/run-1` sidecar directory. The second target overwrote the
+first target's output, grading, timing, metrics, and case summary artifacts.
+Related research in beads `av-74h` and `av-e49` compared Vercel `agent-eval`,
+Vercel `next-evals-oss`, and Margin Evals. Those systems confirm that
+frameworks can either encode model/variant in experiment names or keep run
+manifests as the truth, but none requires AgentV to derive artifact paths from
+suite names.
+
 The final contract keeps authoring and storage separate:
 
 - `eval.yaml` remains the authored experiment spec.
-- a CLI invocation produces one timestamped run bundle;
+- a CLI invocation produces one timestamped invocation directory;
+- each target and optional variant in that invocation gets an isolated result
+  bundle under the timestamp;
 - per-row source identity is stored in `index.jsonl`;
 - `suite` and `name` remain display metadata only;
 - path discovery comes from the run manifest, not from folder conventions.
+- per-row sidecar directories are stable storage allocations, not semantic
+  routing keys.
 
 ## Decision
 
-One AgentV CLI invocation writes one run bundle under:
+One AgentV CLI invocation writes one timestamped invocation directory under:
 
 ```text
 .agentv/results/<experiment>/<timestamp>/
 ```
 
+`experiment` remains the campaign namespace. The timestamp is the
+invocation/batch folder for that CLI run. Within the timestamp, new writers fan
+out into one bundle per target and optional variant:
+
+```text
+.agentv/results/<experiment>/<timestamp>/<target>/<variant?>/
+  index.jsonl
+  summary.json
+  <row_id>/run-1/
+  <row_id>/run-2/
+```
+
+The `<target>/<variant?>` folder split is only storage isolation and manual
+browsing structure. It is not the semantic source for target or variant. Readers
+discover nested `index.jsonl` files, then use loaded summary and row metadata for
+`target`, `variant`, `eval_path`, `suite`, and `test_id` semantics. Legacy
+timestamp-level bundles that keep `index.jsonl` and `summary.json` directly
+under `.agentv/results/<experiment>/<timestamp>/` remain readable.
+
 The result experiment bucket is selected in this order:
 
 1. the explicit CLI `--experiment` value;
@@ -55,42 +88,96 @@ suite names, numbers of input eval files, or multi-eval wrapper shapes.
 Each row in `index.jsonl` is identified by:
 
 ```text
-eval_path + test_id + target
+eval_path + test_id + target + variant
 ```
 
 `eval_path` is the source eval file path relative to the repo root or run
 source root. Dashboard and other readers should display this value as `Eval`.
-They should also display `test_id` and `target` so users can distinguish rows
-with overlapping test IDs.
+They should also display `test_id`, `target`, and `variant` when present so
+users can distinguish rows with overlapping test IDs.
 
 `suite` and `name` are display metadata. They may help humans group or label
-results, but they must not drive storage, routing, Dashboard detail selection,
-rerun lookup, import identity, or artifact discovery.
+results, and `suite` may participate in opaque row-id collision avoidance, but
+they must not drive visible storage hierarchy, semantic routing, Dashboard
+detail selection, rerun lookup, import identity, or artifact discovery.
 
-`index.jsonl` is authoritative for all run-relative artifact paths. Per-row
+`index.jsonl` is authoritative for all bundle-relative artifact paths. Per-row
 directories are exposed with `result_dir`. Sidecar paths such as `task_dir`,
 `summary_path`, `grading_path`, `metrics_path`, `transcript_path`,
 `targets_path`, `files_path`, and `graders_path` are explicit manifest fields.
 Consumers must use these fields instead of reconstructing paths from
-`suite`, `name`, `test_id`, or `target`.
+`suite`, `name`, `test_id`, `target`, `variant`, or target/variant folder names.
+
+`result_dir` is an opaque bundle-local allocation. For newly written artifacts,
+the preferred allocation is a deterministic row directory inside the
+target/variant bundle:
+
+```text
+.agentv/results/<experiment>/<timestamp>/<target>/<variant?>/
+  index.jsonl
+  summary.json
+  <row_id>/run-1/
+  <row_id>/run-2/
+```
+
+There is no required `rows/` parent directory. `row_id` should be stable,
+filesystem-safe, compact, and readable enough for humans to scan, for example:
+
+```text
+<safe_test_id>--<short_hash>
+```
 
-`result_dir` is an opaque run-local allocation. It should stay readable when
-that does not compromise uniqueness, but implementations may suffix or allocate
-otherwise to avoid collisions. Its value is not the public identity of the row.
+The visible `test_id` prefix is only a convenience. The hash input must include
+the collision-prone row fields available at write time: `eval_path` or source
+eval identity, `suite` label, `test_id`, `target`, and `variant`. `eval_path`
+or equivalent source identity is what prevents duplicate suite names from
+colliding; the suite label alone is never a uniqueness boundary. If future row
+identity gains another axis, that axis must be included in the hash before it
+can affect sidecar allocation.
+
+This row-id allocation is intentionally simpler than conditional path
+disambiguation. It avoids special cases for same `test_id` across suites,
+duplicate suite labels, multi-target runs, and target variants while keeping the
+multi-target CLI under one timestamped invocation directory. Existing run
+bundles remain readable because `index.jsonl` already records explicit artifact
+paths; any consumer that infers `<case>/run-1` paths or semantic target
+information from folders instead of following `index.jsonl` is depending on an
+implementation detail and should be fixed.
+
+Reference alternatives considered:
+
+- Vercel `agent-eval` expands model arrays into experiment paths such as
+  `<config>/<model>/<timestamp>/<case>/run-1`. This works for
+  model-as-experiment publication but fragments one multi-target invocation and
+  lets provider names with slashes become path hierarchy.
+- Vercel `next-evals-oss` uses one experiment file per model and `--agents-md`
+  variant, then pairs variants during export. AgentV should allow that style by
+  experiment naming for published baselines, but AgentV remains a superset of
+  Vercel-style naming and must not hard-code Next or Vercel semantics for
+  ordinary multi-target runs.
+- Margin Evals writes one output run directory with result manifests and
+  instance artifacts, without an AgentV-style experiment bucket. That validates
+  manifest-first storage, but dropping AgentV's experiment bucket is a larger
+  semantic change than this bug fix needs.
 
 ## Consequences
 
 - A direct run such as `agentv eval evals/a.eval.yaml evals/b.eval.yaml`
-  produces one timestamped bundle unless the user explicitly runs separate CLI
-  commands.
+  produces one timestamped invocation directory unless the user explicitly runs
+  separate CLI commands.
 - The default no-config path is stable:
   `.agentv/results/default/<timestamp>/`.
 - Renaming a suite or display name does not move prior results or change
   Dashboard routing identity.
 - Multiple eval files can share the same `test_id` and suite display name as
   long as their `eval_path` values differ.
-- Import, rerun, Dashboard, comparison, and export tools can load a run from
-  `index.jsonl` without needing source checkout conventions.
+- Import, rerun, Dashboard, comparison, and export tools can load runs by
+  discovering nested `index.jsonl` manifests without needing source checkout
+  conventions or folder-name semantics.
+- Multi-target and variant runs do not need to become multiple experiments or
+  separate timestamped invocations just to avoid sidecar collisions.
+- New sidecar paths may not resemble the case hierarchy, which is acceptable
+  because `index.jsonl` is the contract for discovery and display.
 
 ## Non-Goals
 
@@ -99,4 +186,5 @@ otherwise to avoid collisions. Its value is not the public identity of the row.
 - Hashing eval paths into default experiment names.
 - Creating automatic `multi-eval` experiment names.
 - Making `result_dir` a semantic folder contract.
+- Adding a `rows/` directory segment without a concrete implementation need.
 - Removing compatibility readers for older run bundles in this ADR.