EntityProcess · christso · Jun 27, 2026 · Jun 27, 2026 · Jun 27, 2026
diff --git a/CONCEPTS.md b/CONCEPTS.md
@@ -16,9 +16,13 @@ Shared domain vocabulary for this project — entities, named processes, and sta
 
 **Experiment** — A committed run variant that selects how evals are executed: target or target matrix, setup, scripts, eval filters, repeat counts, timeouts, workers, budgets, and related run knobs. Experiments make A/B setup differences explicit while pointing at stable eval tasks.
 
-**Run manifest** — The root `index.jsonl` file in a run bundle. It is the dashboard and tooling loading contract for per-case result rows and artifact locations, including fields such as `artifact_dir`, `task_dir`, `summary_path`, and `grading_path`.
+**Run manifest** — The root `index.jsonl` file in a run bundle. It is the dashboard and tooling loading contract for per-case result rows and artifact locations, including fields such as `result_dir`, `task_dir`, `summary_path`, and `grading_path`.
 
-**Artifact sidecar** — A file beside or below a test-case artifact directory that provides evidence for a result, such as `summary.json`, `grading.json`, `result.json`, transcripts, logs, or outputs. Sidecars are evidence, not the primary discovery mechanism for a run.
+**Result source identity** — The stable source identity for a result row: repo-relative `eval_path`, `test_id`, and `target`. `suite` and `name` are display metadata, not storage or routing identity.
+
+**Result directory** — The `result_dir` field in an `index.jsonl` row. It is a run-local directory allocation for that row's sidecars and outputs. Consumers discover it from `index.jsonl` and must not infer it from suite names, display names, test IDs, or targets.
+
+**Artifact sidecar** — A file beside or below a result directory that provides evidence for a result, such as `summary.json`, `grading.json`, `result.json`, transcripts, logs, or outputs. Sidecars are evidence, not the primary discovery mechanism for a run.
 
 ## Evaluation Reliability
 

diff --git a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx
@@ -11,7 +11,13 @@ sidebar:
 agentv eval evals/my-eval.yaml
 ```
 
-Results are written to `.agentv/results/<experiment>/<timestamp>/index.jsonl`. When no experiment is defined, AgentV uses `.agentv/results/default/<timestamp>/index.jsonl`. Each line is a JSON object with one result per test case, and the run workspace also stores the manifest and related artifacts. Use this generated run folder as the portable audit surface: copy or sync the run directory, not a hand-authored parallel bundle.
+Results are written to `.agentv/results/<experiment>/<timestamp>/index.jsonl`.
+AgentV picks the experiment bucket from `--experiment`, then
+`eval.yaml` `experiment.name`, then `default`. Each CLI invocation writes one
+timestamped run bundle. Each line is a JSON object with one result per test
+case, and the run workspace also stores the manifest and related artifacts. Use
+this generated run folder as the portable audit surface: copy or sync the run
+directory, not a hand-authored parallel bundle.
 
 Each `scores[]` entry includes per-grader timing:
 
@@ -49,14 +55,18 @@ agentv eval --target my-target evals/**/*.yaml
 
 ### Experiment Label
 
-Tag a pipeline run with an experiment name to track different conditions (e.g. with vs without skills):
+Tag a run with an experiment name to track different conditions (e.g. with vs without skills):
 
 ```bash
-agentv pipeline run evals/my-eval.yaml --experiment with_skills
-agentv pipeline run evals/my-eval.yaml --experiment without_skills
+agentv eval evals/my-eval.yaml --experiment with_skills
+agentv eval evals/my-eval.yaml --experiment without_skills
 ```
 
-The experiment label is written to `manifest.json` and propagated to each entry in `index.jsonl` by `pipeline bench`. The eval file stays the same across experiments — what changes is the environment. Dashboards can filter and compare results by experiment.
+The experiment label chooses the result bucket and is propagated to each entry
+in `index.jsonl`. CLI `--experiment` wins over `experiment.name` in the eval
+file. If neither is set, AgentV writes to the `default` bucket. The eval file
+stays the same across experiments; what changes is the runtime condition.
+Dashboards can filter and compare results by experiment.
 
 ### Run Specific Test
 
@@ -100,7 +110,7 @@ cat ./my-results/index.jsonl
 
 ### Generated Task Bundles
 
-Each result can also include a generated task bundle inside its per-test artifact
+Each result can also include a generated task bundle inside its per-test result
 directory. The bundle captures the eval slice and target settings that produced
 that row, so reviewers and rerun tooling can inspect the exact run-local source
 instead of relying on a mutable checkout.
@@ -129,14 +139,18 @@ my-results/
 ```
 
 The `index.jsonl` row links to these generated paths with snake_case fields such
-as `artifact_dir`, `task_dir`, `eval_path`, `targets_path`, `files_path`, and
+as `result_dir`, `task_dir`, `eval_path`, `targets_path`, `files_path`, and
 `graders_path`. Treat those paths as relative to the run directory. When you need
 a portable artifact for audit, review, Dashboard inspection, or rerun workflows,
 share the generated run directory and its `index.jsonl` manifest. Source-side
 case directories are still useful for organizing bulky prompts, fixtures, or
 tests while authoring an eval, but they are optional input organization rather
 than a separate artifact schema.
 
+Use repo-relative `eval_path`, `test_id`, and `target` as the source identity
+for a result row. `suite` and `name` are display metadata only; do not use them
+to infer storage paths or pick a Dashboard detail row.
+
 If the source eval uses the `PROMPT.md` fallback instead of inline `input`,
 AgentV records the generated task bundle metadata when source artifacts are
 available. It no longer emits a generated prompt sidecar for result rows.
@@ -346,7 +360,7 @@ agentv eval evals/my-eval.yaml --retry-errors .agentv/results/default/<timestamp
 
 After any failing run, the CLI prints the exact `--rerun-failed` command for the run dir that just completed — copy/paste it. If the process or pod disappeared before you could access the local run directory and results auto-push was enabled, recover the partial run from [WIP checkpoints](/docs/tools/wip-checkpoints/) first, then use the same `--resume` flow.
 
-The interactive wizard (`agentv eval` with no arguments) remembers the last run's artifact directory and surfaces a **"Resume last run"** entry in the main menu when one exists.
+The interactive wizard (`agentv eval` with no arguments) remembers the last run directory and surfaces a **"Resume last run"** entry in the main menu when one exists.
 
 ### Execution Error Tolerance
 
@@ -443,7 +457,7 @@ See the [Import tool docs](/docs/tools/import/) for all providers and options.
 
 ## Transcript And Result Artifacts
 
-Each result row's `artifact_dir` is a case-local folder under the timestamped
+Each result row's `result_dir` is a case-local folder under the timestamped
 run bundle. It can include `transcript.jsonl`, `transcript-raw.jsonl`,
 `grading.json`, `timing.json`, `metrics.json`, and generated outputs under
 `outputs/`. The run root does not contain a mixed transcript artifact; use each

diff --git a/docs/adr/0006-separate-experiments-from-eval-definitions.md b/docs/adr/0006-separate-experiments-from-eval-definitions.md
@@ -9,6 +9,10 @@ Accepted
 Supersedes: the 2026-06-23 proposal in this file to separate experiment files
 from eval definitions.
 
+Partially superseded by
+[ADR 0009](0009-eval-path-result-identity-and-default-experiment.md) for result
+experiment bucket precedence, result row identity, and run bundle path naming.
+
 ## Context
 
 AgentV needs a stable authoring contract for repo-native evals, run-time knobs,

diff --git a/docs/adr/0009-eval-path-result-identity-and-default-experiment.md b/docs/adr/0009-eval-path-result-identity-and-default-experiment.md
@@ -0,0 +1,102 @@
+# 9. Use eval_path identity and the default result experiment
+
+Date: 2026-06-27
+
+## Status
+
+Accepted
+
+Supersedes: result naming and storage-routing portions of
+[ADR 0006](0006-separate-experiments-from-eval-definitions.md) that derive run
+bundle names or per-case artifact paths from eval names, suite names, or wrapper
+composition.
+
+## Context
+
+AgentV needs one simple result identity contract that works for direct eval
+runs, imported evals, repeated attempts, Dashboard inspection, and downstream
+tools that consume portable run bundles.
+
+The previous same-week direction kept `eval.yaml` as the authored experiment
+spec, but it still let result buckets and per-case paths be inferred from eval
+or suite names. That creates unstable routing when a single CLI invocation runs
+multiple eval files, imports suites with overlapping case IDs, or changes
+display metadata without changing the task under evaluation.
+
+The final contract keeps authoring and storage separate:
+
+- `eval.yaml` remains the authored experiment spec.
+- a CLI invocation produces one timestamped run bundle;
+- per-row source identity is stored in `index.jsonl`;
+- `suite` and `name` remain display metadata only;
+- path discovery comes from the run manifest, not from folder conventions.
+
+## Decision
+
+One AgentV CLI invocation writes one run bundle under:
+
+```text
+.agentv/results/<experiment>/<timestamp>/
+```
+
+The result experiment bucket is selected in this order:
+
+1. the explicit CLI `--experiment` value;
+2. `eval.yaml` `experiment.name`;
+3. `default`.
+
+`default` is the canonical bucket when neither the CLI nor the eval file names
+an experiment. AgentV does not derive default experiment names from filenames,
+suite names, numbers of input eval files, or multi-eval wrapper shapes.
+
+`eval.yaml` stays the authored experiment spec. Do not introduce
+`experiment.yaml`, `experiments/default.yaml`, or `eval_root` for this pass.
+
+Each row in `index.jsonl` is identified by:
+
+```text
+eval_path + test_id + target
+```
+
+`eval_path` is the source eval file path relative to the repo root or run
+source root. Dashboard and other readers should display this value as `Eval`.
+They should also display `test_id` and `target` so users can distinguish rows
+with overlapping test IDs.
+
+`suite` and `name` are display metadata. They may help humans group or label
+results, but they must not drive storage, routing, Dashboard detail selection,
+rerun lookup, import identity, or artifact discovery.
+
+`index.jsonl` is authoritative for all run-relative artifact paths. Per-row
+directories are exposed with `result_dir`. Sidecar paths such as `task_dir`,
+`summary_path`, `grading_path`, `metrics_path`, `transcript_path`,
+`targets_path`, `files_path`, and `graders_path` are explicit manifest fields.
+Consumers must use these fields instead of reconstructing paths from
+`suite`, `name`, `test_id`, or `target`.
+
+`result_dir` is an opaque run-local allocation. It should stay readable when
+that does not compromise uniqueness, but implementations may suffix or allocate
+otherwise to avoid collisions. Its value is not the public identity of the row.
+
+## Consequences
+
+- A direct run such as `agentv eval evals/a.eval.yaml evals/b.eval.yaml`
+  produces one timestamped bundle unless the user explicitly runs separate CLI
+  commands.
+- The default no-config path is stable:
+  `.agentv/results/default/<timestamp>/`.
+- Renaming a suite or display name does not move prior results or change
+  Dashboard routing identity.
+- Multiple eval files can share the same `test_id` and suite display name as
+  long as their `eval_path` values differ.
+- Import, rerun, Dashboard, comparison, and export tools can load a run from
+  `index.jsonl` without needing source checkout conventions.
+
+## Non-Goals
+
+- Defining an `experiment.yaml` artifact.
+- Adding `eval_root`.
+- Hashing eval paths into default experiment names.
+- Creating automatic `multi-eval` experiment names.
+- Making `result_dir` a semantic folder contract.
+- Removing compatibility readers for older run bundles in this ADR.