From 8890e3622b7a3eceafe1b8698ddce2907369d4ce Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Mon, 29 Jun 2026 01:14:30 +0200 Subject: [PATCH 1/3] docs(adr): clarify result row sidecar identity --- ...arate-experiments-from-eval-definitions.md | 27 ++++--- ...-result-identity-and-default-experiment.md | 75 +++++++++++++++++-- 2 files changed, 84 insertions(+), 18 deletions(-) diff --git a/docs/adr/0006-separate-experiments-from-eval-definitions.md b/docs/adr/0006-separate-experiments-from-eval-definitions.md index e4ad436ad..c5b176469 100644 --- a/docs/adr/0006-separate-experiments-from-eval-definitions.md +++ b/docs/adr/0006-separate-experiments-from-eval-definitions.md @@ -371,28 +371,31 @@ This is the motivating distinction: ## Result Layout -The canonical writer path is: +ADR 0009 is the source of truth for result experiment bucket precedence, row +identity, and sidecar path allocation. The canonical run bundle root is: ```text -.agentv/results///... +.agentv/results///... ``` There is no `.agentv/results/runs` segment in canonical writer output. There is -also no default nested suite segment when the result group is already the same -eval suite being run directly. +also no schema-significant suite or imported-suite directory segment. -If a wrapper eval imports another suite with `type: suite`, test artifacts from -that imported suite are nested under the imported suite identity: +Within the timestamp, `index.jsonl` is authoritative for row identity and all +run-relative sidecar paths. New writers should allocate deterministic row +directories directly under the timestamp: ```text -.agentv/results/////... +.agentv/results////run-1/... ``` -The suite segment is required for imported suites because wrapper evals can -compose many suites with overlapping test IDs, and the directory tree should -remain inspectable without reading every manifest row. Test artifacts from tests -owned directly by the wrapper eval can still live directly under ``. -All cases should also retain source suite metadata in manifests and index rows. +`row_id` is a stable, filesystem-safe allocation such as +`--`. The hash input includes the source eval identity +or `eval_path`, suite label, `test_id`, target, and variant. This keeps +same-`test_id` rows from different suites, duplicate suite labels, targets, and +variants from overwriting each other without making path hierarchy a semantic +contract. There is no required `rows/` parent directory. Source suite metadata +still belongs in manifests and index rows. The result namespace remains `experiment` in artifacts and Dashboard. AgentV should not introduce a separate authored `run_group` field. For better DX, diff --git a/docs/adr/0009-eval-path-result-identity-and-default-experiment.md b/docs/adr/0009-eval-path-result-identity-and-default-experiment.md index c48fb0702..f5deb1fb7 100644 --- a/docs/adr/0009-eval-path-result-identity-and-default-experiment.md +++ b/docs/adr/0009-eval-path-result-identity-and-default-experiment.md @@ -23,6 +23,16 @@ or suite names. That creates unstable routing when a single CLI invocation runs multiple eval files, imports suites with overlapping case IDs, or changes display metadata without changing the task under evaluation. +Follow-up dogfood in bead `av-770` found a concrete bug in that direction: +multi-target AgentV runs stored different targets for the same `test_id` under +the same `/run-N` sidecar directory. The second target overwrote the +first target's output, grading, timing, metrics, and case summary artifacts. +Related research in beads `av-74h` and `av-e49` compared Vercel `agent-eval`, +Vercel `next-evals-oss`, and Margin Evals. Those systems confirm that +frameworks can either encode model/variant in experiment names or keep run +manifests as the truth, but none requires AgentV to derive artifact paths from +suite names. + The final contract keeps authoring and storage separate: - `eval.yaml` remains the authored experiment spec. @@ -30,6 +40,8 @@ The final contract keeps authoring and storage separate: - per-row source identity is stored in `index.jsonl`; - `suite` and `name` remain display metadata only; - path discovery comes from the run manifest, not from folder conventions. +- per-row sidecar directories are stable storage allocations, not semantic + routing keys. ## Decision @@ -55,13 +67,13 @@ suite names, numbers of input eval files, or multi-eval wrapper shapes. Each row in `index.jsonl` is identified by: ```text -eval_path + test_id + target +eval_path + test_id + target + variant ``` `eval_path` is the source eval file path relative to the repo root or run source root. Dashboard and other readers should display this value as `Eval`. -They should also display `test_id` and `target` so users can distinguish rows -with overlapping test IDs. +They should also display `test_id`, `target`, and `variant` when present so +users can distinguish rows with overlapping test IDs. `suite` and `name` are display metadata. They may help humans group or label results, but they must not drive storage, routing, Dashboard detail selection, @@ -74,9 +86,55 @@ directories are exposed with `result_dir`. Sidecar paths such as `task_dir`, Consumers must use these fields instead of reconstructing paths from `suite`, `name`, `test_id`, or `target`. -`result_dir` is an opaque run-local allocation. It should stay readable when -that does not compromise uniqueness, but implementations may suffix or allocate -otherwise to avoid collisions. Its value is not the public identity of the row. +`result_dir` is an opaque run-local allocation. For newly written artifacts, the +preferred allocation is a deterministic row directory directly under the +timestamp: + +```text +.agentv/results/// + index.jsonl + summary.json + /run-1/ + /run-2/ +``` + +There is no required `rows/` parent directory. `row_id` should be stable, +filesystem-safe, compact, and readable enough for humans to scan, for example: + +```text +-- +``` + +The visible `test_id` prefix is only a convenience. The hash input must include +the collision-prone row fields available at write time: `eval_path` or source +eval identity, `suite` label, `test_id`, `target`, and `variant`. `eval_path` +or equivalent source identity is what prevents duplicate suite names from +colliding; the suite label alone is never a uniqueness boundary. If future row +identity gains another axis, that axis must be included in the hash before it +can affect sidecar allocation. + +This row-id allocation is intentionally simpler than conditional path +disambiguation. It avoids special cases for same `test_id` across suites, +duplicate suite labels, multi-target runs, and target variants. Existing run +bundles remain readable because `index.jsonl` already records explicit +run-relative paths; any consumer that infers `/run-N` paths instead of +following `index.jsonl` is depending on an implementation detail and should be +fixed. + +Reference alternatives considered: + +- Vercel `agent-eval` expands model arrays into experiment paths such as + `////run-N`. This works for + model-as-experiment publication but fragments one multi-target invocation and + lets provider names with slashes become path hierarchy. +- Vercel `next-evals-oss` uses one experiment file per model and `--agents-md` + variant, then pairs variants during export. AgentV should allow that style by + experiment naming for published baselines, but not require it for ordinary + multi-target runs. +- Margin Evals writes one output run directory with result manifests and + instance artifacts, without an AgentV-style experiment bucket. That validates + manifest-first storage, but dropping AgentV's experiment bucket is a larger + semantic change than this bug fix needs. ## Consequences @@ -91,6 +149,10 @@ otherwise to avoid collisions. Its value is not the public identity of the row. long as their `eval_path` values differ. - Import, rerun, Dashboard, comparison, and export tools can load a run from `index.jsonl` without needing source checkout conventions. +- Multi-target and variant runs do not need to become multiple experiments just + to avoid sidecar collisions. +- New sidecar paths may not resemble the case hierarchy, which is acceptable + because `index.jsonl` is the contract for discovery and display. ## Non-Goals @@ -99,4 +161,5 @@ otherwise to avoid collisions. Its value is not the public identity of the row. - Hashing eval paths into default experiment names. - Creating automatic `multi-eval` experiment names. - Making `result_dir` a semantic folder contract. +- Adding a `rows/` directory segment without a concrete implementation need. - Removing compatibility readers for older run bundles in this ADR. From 0456fa609b66c0f372b3b995150992301a7c9cc2 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Mon, 29 Jun 2026 08:43:30 +0200 Subject: [PATCH 2/3] docs(adr): align row identity with target bundles --- ...arate-experiments-from-eval-definitions.md | 43 +++++++---- ...-result-identity-and-default-experiment.md | 74 ++++++++++++------- 2 files changed, 78 insertions(+), 39 deletions(-) diff --git a/docs/adr/0006-separate-experiments-from-eval-definitions.md b/docs/adr/0006-separate-experiments-from-eval-definitions.md index c5b176469..0cee8caea 100644 --- a/docs/adr/0006-separate-experiments-from-eval-definitions.md +++ b/docs/adr/0006-separate-experiments-from-eval-definitions.md @@ -31,7 +31,9 @@ The final design keeps the product boundary smaller: - `eval.yaml` is the only runnable authoring artifact. - `experiment:` is an inline run-time block inside `eval.yaml`. - `tests:` is the composition, import, and selection surface. -- result bundles are written under `.agentv/results///`. +- result invocations are written under + `.agentv/results///`, with target/variant bundles below + the timestamp. - A directory named `experiments/` may be used as a user-owned repo convention for wrapper eval YAML files, but it does not create a separate experiment artifact type or schema-significant path. @@ -371,8 +373,9 @@ This is the motivating distinction: ## Result Layout -ADR 0009 is the source of truth for result experiment bucket precedence, row -identity, and sidecar path allocation. The canonical run bundle root is: +ADR 0009 is the source of truth for result experiment bucket precedence, +target/variant bundle layout, row identity, and sidecar path allocation. The +canonical invocation directory is: ```text .agentv/results///... @@ -381,21 +384,33 @@ identity, and sidecar path allocation. The canonical run bundle root is: There is no `.agentv/results/runs` segment in canonical writer output. There is also no schema-significant suite or imported-suite directory segment. -Within the timestamp, `index.jsonl` is authoritative for row identity and all -run-relative sidecar paths. New writers should allocate deterministic row -directories directly under the timestamp: +Within the timestamp, new writers fan out into a target and optional variant +bundle. Each target/variant bundle has its own `index.jsonl` and `summary.json`: ```text -.agentv/results////run-1/... +.agentv/results///// + index.jsonl + summary.json + /run-1/ + /run-2/ ``` -`row_id` is a stable, filesystem-safe allocation such as -`--`. The hash input includes the source eval identity -or `eval_path`, suite label, `test_id`, target, and variant. This keeps -same-`test_id` rows from different suites, duplicate suite labels, targets, and -variants from overwriting each other without making path hierarchy a semantic -contract. There is no required `rows/` parent directory. Source suite metadata -still belongs in manifests and index rows. +The target/variant folder split is for storage isolation and manual browsing +only. Dashboard, import, rerun, comparison, and export readers discover nested +`index.jsonl` files, then use bundle summary and row metadata for `target`, +`variant`, `eval_path`, `suite`, and `test_id` semantics. Legacy timestamp-level +bundles with `index.jsonl` directly under +`.agentv/results///` remain readable. + +Inside each target/variant bundle, `index.jsonl` is authoritative for row +identity and all bundle-relative sidecar paths. `row_id` directories are stable, +filesystem-safe allocations such as `--`, and repeated +runs live under `run-1`, `run-2`, and so on. The hash input includes the source +eval identity or `eval_path`, suite label, `test_id`, target, and variant. This +keeps same-`test_id` rows from different suites, duplicate suite labels, targets, +and variants from overwriting each other without making path hierarchy a +semantic contract. There is no required `rows/` parent directory. Source suite +metadata still belongs in manifests and index rows. The result namespace remains `experiment` in artifacts and Dashboard. AgentV should not introduce a separate authored `run_group` field. For better DX, diff --git a/docs/adr/0009-eval-path-result-identity-and-default-experiment.md b/docs/adr/0009-eval-path-result-identity-and-default-experiment.md index f5deb1fb7..c77baa0a8 100644 --- a/docs/adr/0009-eval-path-result-identity-and-default-experiment.md +++ b/docs/adr/0009-eval-path-result-identity-and-default-experiment.md @@ -14,8 +14,8 @@ composition. ## Context AgentV needs one simple result identity contract that works for direct eval -runs, imported evals, repeated attempts, Dashboard inspection, and downstream -tools that consume portable run bundles. +runs, imported evals, repeated runs, Dashboard inspection, and downstream tools +that consume portable run bundles. The previous same-week direction kept `eval.yaml` as the authored experiment spec, but it still let result buckets and per-case paths be inferred from eval @@ -25,7 +25,7 @@ display metadata without changing the task under evaluation. Follow-up dogfood in bead `av-770` found a concrete bug in that direction: multi-target AgentV runs stored different targets for the same `test_id` under -the same `/run-N` sidecar directory. The second target overwrote the +the same `/run-1` sidecar directory. The second target overwrote the first target's output, grading, timing, metrics, and case summary artifacts. Related research in beads `av-74h` and `av-e49` compared Vercel `agent-eval`, Vercel `next-evals-oss`, and Margin Evals. Those systems confirm that @@ -36,7 +36,9 @@ suite names. The final contract keeps authoring and storage separate: - `eval.yaml` remains the authored experiment spec. -- a CLI invocation produces one timestamped run bundle; +- a CLI invocation produces one timestamped invocation directory; +- each target and optional variant in that invocation gets an isolated result + bundle under the timestamp; - per-row source identity is stored in `index.jsonl`; - `suite` and `name` remain display metadata only; - path discovery comes from the run manifest, not from folder conventions. @@ -45,12 +47,31 @@ The final contract keeps authoring and storage separate: ## Decision -One AgentV CLI invocation writes one run bundle under: +One AgentV CLI invocation writes one timestamped invocation directory under: ```text .agentv/results/// ``` +`experiment` remains the campaign namespace. The timestamp is the +invocation/batch folder for that CLI run. Within the timestamp, new writers fan +out into one bundle per target and optional variant: + +```text +.agentv/results///// + index.jsonl + summary.json + /run-1/ + /run-2/ +``` + +The `/` folder split is only storage isolation and manual +browsing structure. It is not the semantic source for target or variant. Readers +discover nested `index.jsonl` files, then use loaded summary and row metadata for +`target`, `variant`, `eval_path`, `suite`, and `test_id` semantics. Legacy +timestamp-level bundles that keep `index.jsonl` and `summary.json` directly +under `.agentv/results///` remain readable. + The result experiment bucket is selected in this order: 1. the explicit CLI `--experiment` value; @@ -79,19 +100,19 @@ users can distinguish rows with overlapping test IDs. results, but they must not drive storage, routing, Dashboard detail selection, rerun lookup, import identity, or artifact discovery. -`index.jsonl` is authoritative for all run-relative artifact paths. Per-row +`index.jsonl` is authoritative for all bundle-relative artifact paths. Per-row directories are exposed with `result_dir`. Sidecar paths such as `task_dir`, `summary_path`, `grading_path`, `metrics_path`, `transcript_path`, `targets_path`, `files_path`, and `graders_path` are explicit manifest fields. Consumers must use these fields instead of reconstructing paths from -`suite`, `name`, `test_id`, or `target`. +`suite`, `name`, `test_id`, `target`, `variant`, or target/variant folder names. -`result_dir` is an opaque run-local allocation. For newly written artifacts, the -preferred allocation is a deterministic row directory directly under the -timestamp: +`result_dir` is an opaque bundle-local allocation. For newly written artifacts, +the preferred allocation is a deterministic row directory inside the +target/variant bundle: ```text -.agentv/results/// +.agentv/results///// index.jsonl summary.json /run-1/ @@ -115,22 +136,24 @@ can affect sidecar allocation. This row-id allocation is intentionally simpler than conditional path disambiguation. It avoids special cases for same `test_id` across suites, -duplicate suite labels, multi-target runs, and target variants. Existing run -bundles remain readable because `index.jsonl` already records explicit -run-relative paths; any consumer that infers `/run-N` paths instead of -following `index.jsonl` is depending on an implementation detail and should be -fixed. +duplicate suite labels, multi-target runs, and target variants while keeping the +multi-target CLI under one timestamped invocation directory. Existing run +bundles remain readable because `index.jsonl` already records explicit artifact +paths; any consumer that infers `/run-1` paths or semantic target +information from folders instead of following `index.jsonl` is depending on an +implementation detail and should be fixed. Reference alternatives considered: - Vercel `agent-eval` expands model arrays into experiment paths such as - `////run-N`. This works for + `////run-1`. This works for model-as-experiment publication but fragments one multi-target invocation and lets provider names with slashes become path hierarchy. - Vercel `next-evals-oss` uses one experiment file per model and `--agents-md` variant, then pairs variants during export. AgentV should allow that style by - experiment naming for published baselines, but not require it for ordinary - multi-target runs. + experiment naming for published baselines, but AgentV remains a superset of + Vercel-style naming and must not hard-code Next or Vercel semantics for + ordinary multi-target runs. - Margin Evals writes one output run directory with result manifests and instance artifacts, without an AgentV-style experiment bucket. That validates manifest-first storage, but dropping AgentV's experiment bucket is a larger @@ -139,18 +162,19 @@ Reference alternatives considered: ## Consequences - A direct run such as `agentv eval evals/a.eval.yaml evals/b.eval.yaml` - produces one timestamped bundle unless the user explicitly runs separate CLI - commands. + produces one timestamped invocation directory unless the user explicitly runs + separate CLI commands. - The default no-config path is stable: `.agentv/results/default//`. - Renaming a suite or display name does not move prior results or change Dashboard routing identity. - Multiple eval files can share the same `test_id` and suite display name as long as their `eval_path` values differ. -- Import, rerun, Dashboard, comparison, and export tools can load a run from - `index.jsonl` without needing source checkout conventions. -- Multi-target and variant runs do not need to become multiple experiments just - to avoid sidecar collisions. +- Import, rerun, Dashboard, comparison, and export tools can load runs by + discovering nested `index.jsonl` manifests without needing source checkout + conventions or folder-name semantics. +- Multi-target and variant runs do not need to become multiple experiments or + separate timestamped invocations just to avoid sidecar collisions. - New sidecar paths may not resemble the case hierarchy, which is acceptable because `index.jsonl` is the contract for discovery and display. From e2867cc0445645a0c67c494147f08578d9636d9f Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Mon, 29 Jun 2026 10:51:33 +0200 Subject: [PATCH 3/3] docs(adr): clarify suite row-id role --- .../0009-eval-path-result-identity-and-default-experiment.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/adr/0009-eval-path-result-identity-and-default-experiment.md b/docs/adr/0009-eval-path-result-identity-and-default-experiment.md index c77baa0a8..9ce000bbb 100644 --- a/docs/adr/0009-eval-path-result-identity-and-default-experiment.md +++ b/docs/adr/0009-eval-path-result-identity-and-default-experiment.md @@ -97,8 +97,9 @@ They should also display `test_id`, `target`, and `variant` when present so users can distinguish rows with overlapping test IDs. `suite` and `name` are display metadata. They may help humans group or label -results, but they must not drive storage, routing, Dashboard detail selection, -rerun lookup, import identity, or artifact discovery. +results, and `suite` may participate in opaque row-id collision avoidance, but +they must not drive visible storage hierarchy, semantic routing, Dashboard +detail selection, rerun lookup, import identity, or artifact discovery. `index.jsonl` is authoritative for all bundle-relative artifact paths. Per-row directories are exposed with `result_dir`. Sidecar paths such as `task_dir`,