From a7d5f07d5af1c287ab80d9dd57b3ce077565939c Mon Sep 17 00:00:00 2001
From: Christopher Tso <christso@gmail.com>
Date: Sat, 27 Jun 2026 08:55:03 +0200
Subject: [PATCH 1/2] docs(results): codify eval result identity contract

---
 CONCEPTS.md                                   |   8 +-
 .../docs/docs/evaluation/running-evals.mdx    |  28 +++--
 ...arate-experiments-from-eval-definitions.md |   4 +
 ...-result-identity-and-default-experiment.md | 102 ++++++++++++++++++
 4 files changed, 133 insertions(+), 9 deletions(-)
 create mode 100644 docs/adr/0009-eval-path-result-identity-and-default-experiment.md
diff --git a/CONCEPTS.md b/CONCEPTS.md
index 3014bf601..bbeeefdff 100644
--- a/CONCEPTS.md
+++ b/CONCEPTS.md
@@ -16,9 +16,13 @@ Shared domain vocabulary for this project — entities, named processes, and sta
 
 **Experiment** — A committed run variant that selects how evals are executed: target or target matrix, setup, scripts, eval filters, repeat counts, timeouts, workers, budgets, and related run knobs. Experiments make A/B setup differences explicit while pointing at stable eval tasks.
 
-**Run manifest** — The root `index.jsonl` file in a run bundle. It is the dashboard and tooling loading contract for per-case result rows and artifact locations, including fields such as `artifact_dir`, `task_dir`, `summary_path`, and `grading_path`.
+**Run manifest** — The root `index.jsonl` file in a run bundle. It is the dashboard and tooling loading contract for per-case result rows and artifact locations, including fields such as `result_dir`, `task_dir`, `summary_path`, and `grading_path`.
 
-**Artifact sidecar** — A file beside or below a test-case artifact directory that provides evidence for a result, such as `summary.json`, `grading.json`, `result.json`, transcripts, logs, or outputs. Sidecars are evidence, not the primary discovery mechanism for a run.
+**Result source identity** — The stable source identity for a result row: repo-relative `eval_path`, `test_id`, and `target`. `suite` and `name` are display metadata, not storage or routing identity.
+
+**Result directory** — The `result_dir` field in an `index.jsonl` row. It is a run-local directory allocation for that row's sidecars and outputs. Consumers discover it from `index.jsonl` and must not infer it from suite names, display names, test IDs, or targets.
+
+**Artifact sidecar** — A file beside or below a result directory that provides evidence for a result, such as `summary.json`, `grading.json`, `result.json`, transcripts, logs, or outputs. Sidecars are evidence, not the primary discovery mechanism for a run.
 
 ## Evaluation Reliability
 
diff --git a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx
index 92b577f9f..8a869a4ac 100644
--- a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx
+++ b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx
@@ -11,7 +11,13 @@ sidebar:
 agentv eval evals/my-eval.yaml
 ```
 
-Results are written to `.agentv/results/<experiment>/<timestamp>/index.jsonl`. When no experiment is defined, AgentV uses `.agentv/results/default/<timestamp>/index.jsonl`. Each line is a JSON object with one result per test case, and the run workspace also stores the manifest and related artifacts. Use this generated run folder as the portable audit surface: copy or sync the run directory, not a hand-authored parallel bundle.
+Results are written to `.agentv/results/<experiment>/<timestamp>/index.jsonl`.
+AgentV picks the experiment bucket from `--experiment`, then
+`eval.yaml` `experiment.name`, then `default`. Each CLI invocation writes one
+timestamped run bundle. Each line is a JSON object with one result per test
+case, and the run workspace also stores the manifest and related artifacts. Use
+this generated run folder as the portable audit surface: copy or sync the run
+directory, not a hand-authored parallel bundle.
 
 Each `scores[]` entry includes per-grader timing:
 
@@ -49,14 +55,18 @@ agentv eval --target my-target evals/**/*.yaml
 
 ### Experiment Label
 
-Tag a pipeline run with an experiment name to track different conditions (e.g. with vs without skills):
+Tag a run with an experiment name to track different conditions (e.g. with vs without skills):
 
 ```bash
-agentv pipeline run evals/my-eval.yaml --experiment with_skills
-agentv pipeline run evals/my-eval.yaml --experiment without_skills
+agentv eval evals/my-eval.yaml --experiment with_skills
+agentv eval evals/my-eval.yaml --experiment without_skills
 ```
 
-The experiment label is written to `manifest.json` and propagated to each entry in `index.jsonl` by `pipeline bench`. The eval file stays the same across experiments — what changes is the environment. Dashboards can filter and compare results by experiment.
+The experiment label chooses the result bucket and is propagated to each entry
+in `index.jsonl`. CLI `--experiment` wins over `experiment.name` in the eval
+file. If neither is set, AgentV writes to the `default` bucket. The eval file
+stays the same across experiments; what changes is the runtime condition.
+Dashboards can filter and compare results by experiment.
 
 ### Run Specific Test
 
@@ -129,7 +139,7 @@ my-results/
 ```
 
 The `index.jsonl` row links to these generated paths with snake_case fields such
-as `artifact_dir`, `task_dir`, `eval_path`, `targets_path`, `files_path`, and
+as `result_dir`, `task_dir`, `eval_path`, `targets_path`, `files_path`, and
 `graders_path`. Treat those paths as relative to the run directory. When you need
 a portable artifact for audit, review, Dashboard inspection, or rerun workflows,
 share the generated run directory and its `index.jsonl` manifest. Source-side
@@ -137,6 +147,10 @@ case directories are still useful for organizing bulky prompts, fixtures, or
 tests while authoring an eval, but they are optional input organization rather
 than a separate artifact schema.
 
+Use repo-relative `eval_path`, `test_id`, and `target` as the source identity
+for a result row. `suite` and `name` are display metadata only; do not use them
+to infer storage paths or pick a Dashboard detail row.
+
 If the source eval uses the `PROMPT.md` fallback instead of inline `input`,
 AgentV records the generated task bundle metadata when source artifacts are
 available. It no longer emits a generated prompt sidecar for result rows.
@@ -443,7 +457,7 @@ See the [Import tool docs](/docs/tools/import/) for all providers and options.
 
 ## Transcript And Result Artifacts
 
-Each result row's `artifact_dir` is a case-local folder under the timestamped
+Each result row's `result_dir` is a case-local folder under the timestamped
 run bundle. It can include `transcript.jsonl`, `transcript-raw.jsonl`,
 `grading.json`, `timing.json`, `metrics.json`, and generated outputs under
 `outputs/`. The run root does not contain a mixed transcript artifact; use each
diff --git a/docs/adr/0006-separate-experiments-from-eval-definitions.md b/docs/adr/0006-separate-experiments-from-eval-definitions.md
index 752495b71..8e441d8bf 100644
--- a/docs/adr/0006-separate-experiments-from-eval-definitions.md
+++ b/docs/adr/0006-separate-experiments-from-eval-definitions.md
@@ -9,6 +9,10 @@ Accepted
 Supersedes: the 2026-06-23 proposal in this file to separate experiment files
 from eval definitions.
 
+Partially superseded by
+[ADR 0009](0009-eval-path-result-identity-and-default-experiment.md) for result
+experiment bucket precedence, result row identity, and run bundle path naming.
+
 ## Context
 
 AgentV needs a stable authoring contract for repo-native evals, run-time knobs,
diff --git a/docs/adr/0009-eval-path-result-identity-and-default-experiment.md b/docs/adr/0009-eval-path-result-identity-and-default-experiment.md
new file mode 100644
index 000000000..c48fb0702
--- /dev/null
+++ b/docs/adr/0009-eval-path-result-identity-and-default-experiment.md
@@ -0,0 +1,102 @@
+# 9. Use eval_path identity and the default result experiment
+
+Date: 2026-06-27
+
+## Status
+
+Accepted
+
+Supersedes: result naming and storage-routing portions of
+[ADR 0006](0006-separate-experiments-from-eval-definitions.md) that derive run
+bundle names or per-case artifact paths from eval names, suite names, or wrapper
+composition.
+
+## Context
+
+AgentV needs one simple result identity contract that works for direct eval
+runs, imported evals, repeated attempts, Dashboard inspection, and downstream
+tools that consume portable run bundles.
+
+The previous same-week direction kept `eval.yaml` as the authored experiment
+spec, but it still let result buckets and per-case paths be inferred from eval
+or suite names. That creates unstable routing when a single CLI invocation runs
+multiple eval files, imports suites with overlapping case IDs, or changes
+display metadata without changing the task under evaluation.
+
+The final contract keeps authoring and storage separate:
+
+- `eval.yaml` remains the authored experiment spec.
+- a CLI invocation produces one timestamped run bundle;
+- per-row source identity is stored in `index.jsonl`;
+- `suite` and `name` remain display metadata only;
+- path discovery comes from the run manifest, not from folder conventions.
+
+## Decision
+
+One AgentV CLI invocation writes one run bundle under:
+
+```text
+.agentv/results/<experiment>/<timestamp>/
+```
+
+The result experiment bucket is selected in this order:
+
+1. the explicit CLI `--experiment` value;
+2. `eval.yaml` `experiment.name`;
+3. `default`.
+
+`default` is the canonical bucket when neither the CLI nor the eval file names
+an experiment. AgentV does not derive default experiment names from filenames,
+suite names, numbers of input eval files, or multi-eval wrapper shapes.
+
+`eval.yaml` stays the authored experiment spec. Do not introduce
+`experiment.yaml`, `experiments/default.yaml`, or `eval_root` for this pass.
+
+Each row in `index.jsonl` is identified by:
+
+```text
+eval_path + test_id + target
+```
+
+`eval_path` is the source eval file path relative to the repo root or run
+source root. Dashboard and other readers should display this value as `Eval`.
+They should also display `test_id` and `target` so users can distinguish rows
+with overlapping test IDs.
+
+`suite` and `name` are display metadata. They may help humans group or label
+results, but they must not drive storage, routing, Dashboard detail selection,
+rerun lookup, import identity, or artifact discovery.
+
+`index.jsonl` is authoritative for all run-relative artifact paths. Per-row
+directories are exposed with `result_dir`. Sidecar paths such as `task_dir`,
+`summary_path`, `grading_path`, `metrics_path`, `transcript_path`,
+`targets_path`, `files_path`, and `graders_path` are explicit manifest fields.
+Consumers must use these fields instead of reconstructing paths from
+`suite`, `name`, `test_id`, or `target`.
+
+`result_dir` is an opaque run-local allocation. It should stay readable when
+that does not compromise uniqueness, but implementations may suffix or allocate
+otherwise to avoid collisions. Its value is not the public identity of the row.
+
+## Consequences
+
+- A direct run such as `agentv eval evals/a.eval.yaml evals/b.eval.yaml`
+  produces one timestamped bundle unless the user explicitly runs separate CLI
+  commands.
+- The default no-config path is stable:
+  `.agentv/results/default/<timestamp>/`.
+- Renaming a suite or display name does not move prior results or change
+  Dashboard routing identity.
+- Multiple eval files can share the same `test_id` and suite display name as
+  long as their `eval_path` values differ.
+- Import, rerun, Dashboard, comparison, and export tools can load a run from
+  `index.jsonl` without needing source checkout conventions.
+
+## Non-Goals
+
+- Defining an `experiment.yaml` artifact.
+- Adding `eval_root`.
+- Hashing eval paths into default experiment names.
+- Creating automatic `multi-eval` experiment names.
+- Making `result_dir` a semantic folder contract.
+- Removing compatibility readers for older run bundles in this ADR.

From da48f18a579fcb7c3c0b205c993e1b787b91af5c Mon Sep 17 00:00:00 2001
From: Christopher Tso <christso@gmail.com>
Date: Sat, 27 Jun 2026 09:08:25 +0200
Subject: [PATCH 2/2] docs(results): align result directory terminology

---
 apps/web/src/content/docs/docs/evaluation/running-evals.mdx | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx
index 8a869a4ac..f6664b336 100644
--- a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx
+++ b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx
@@ -110,7 +110,7 @@ cat ./my-results/index.jsonl
 
 ### Generated Task Bundles
 
-Each result can also include a generated task bundle inside its per-test artifact
+Each result can also include a generated task bundle inside its per-test result
 directory. The bundle captures the eval slice and target settings that produced
 that row, so reviewers and rerun tooling can inspect the exact run-local source
 instead of relying on a mutable checkout.
@@ -360,7 +360,7 @@ agentv eval evals/my-eval.yaml --retry-errors .agentv/results/default/<timestamp
 
 After any failing run, the CLI prints the exact `--rerun-failed` command for the run dir that just completed — copy/paste it. If the process or pod disappeared before you could access the local run directory and results auto-push was enabled, recover the partial run from [WIP checkpoints](/docs/tools/wip-checkpoints/) first, then use the same `--resume` flow.
 
-The interactive wizard (`agentv eval` with no arguments) remembers the last run's artifact directory and surfaces a **"Resume last run"** entry in the main menu when one exists.
+The interactive wizard (`agentv eval` with no arguments) remembers the last run directory and surfaces a **"Resume last run"** entry in the main menu when one exists.
 
 ### Execution Error Tolerance