EntityProcess · christso · Jun 27, 2026 · Jun 27, 2026 · Jun 27, 2026 · Jun 27, 2026
diff --git a/apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx b/apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx
@@ -9,7 +9,7 @@ Benchmark suites usually need more than a prompt and a score. They carry source
 pins, task patches, generated dataset rows, oracle data, setup scripts, and
 verification commands. AgentV represents that with existing primitives:
 
-- Put runtime behavior in `workspace`, `execution`, `input`, `expected_output`,
+- Put runtime behavior in `workspace`, `experiment`, `input`, `expected_output`,
   and `assertions`.
 - Put provenance and classification in per-case `metadata`.
 - Put bulky per-case authoring inputs in optional case directories and supporting files.
@@ -29,7 +29,7 @@ Use this split when deciding where a benchmark key belongs:
 | `workspace.template` | Yes | Copies a workspace template into the run workspace. |
 | `workspace.hooks` | Yes | Runs lifecycle commands with workspace and case context on stdin. |
 | `workspace.isolation`, `workspace.mode`, `workspace.path` | Yes | Controls workspace reuse and materialization. |
-| `execution` | Yes | Selects targets, thresholds, dependencies, and default grader behavior. |
+| `experiment` | Yes | Selects targets, thresholds, repeat policy, budgets, workers, and default grader behavior. |
 | `input`, `input_files`, `expected_output` | Yes | Builds the target prompt and passive reference answer. |
 | `assertions` | Yes | Runs deterministic, LLM, composite, or code graders. |
 | Top-level `name`, `version`, `tags`, `license`, `requires` | Informational | Identifies and categorizes the suite. |
@@ -162,7 +162,7 @@ workspace:
     before_each:
       command: ["python", "./scripts/apply-case-fixtures.py"]
 
-execution:
+experiment:
   targets: [codex, claude]
 
 assertions:
@@ -182,20 +182,41 @@ the imported results, and link Opik traces when Harbor uploads them.
 # Proposed runner boundary, not a current AgentV task schema.
 name: swebench-verified-codex
 
-execution:
-  runner: harbor
-  harbor:
-    dataset: swebench-verified
-    agent: codex
-    model: openai/gpt-5-mini
-    opik:
-      enabled: true
+experiment:
+  target: codex
+  model: openai/gpt-5-mini
+  runner:
+    type: harbor
+    options:
+      opik:
+        enabled: true
 ```
 
 Do not translate Harbor `task.toml`, verifier packaging, or suite-specific
 Docker/Compose adapter fields into AgentV core eval schema. If the benchmark's
 runtime contract is already owned by Harbor, keep those details in Harbor and
 let AgentV consume the job metadata, rewards, artifacts, and trace links.
+Do not add a generic top-level `source` field just to identify Harbor. If a
+future Harbor adapter needs suite selection, keep that selector narrow and
+adapter-owned instead of making it the AgentV workspace model.
+
+## Eval Composition
+
+When one eval references another eval, preserve the task/runtime split:
+
+- The parent runnable eval owns runtime `experiment:` for the run.
+- Child `experiment:` blocks are ignored by `type: suite` composition. There is
+  no fallback to the child `experiment:` when the parent has no `experiment:`.
+- Child `workspace` setup is preserved for `type: suite` imports. Parent
+  workspace applies to raw cases owned by the parent file, not to imported suite
+  tests.
+- A tests-only import can drop child workspace context only when the import mode
+  says so explicitly.
+- Workspace path collisions or incompatible isolation settings should fail
+  loudly if a future explicit remap mode is added.
+
+That rule keeps imported benchmark cases attached to their setup while still
+letting a parent eval compare targets, repeat policy, and gates consistently.
 
 ## Finance-Style Generated Dataset
 

diff --git a/docs/adr/0002-keep-harbor-benchmark-execution-behind-runner-boundary.md b/docs/adr/0002-keep-harbor-benchmark-execution-behind-runner-boundary.md
@@ -11,8 +11,9 @@ Proposed
 AgentV now has native workspace repository acquisition for custom evals, CI
 gates, target comparisons, pooled workspaces, hooks, and Docker workspace cases.
 That should remain generic infrastructure: `workspace.repos[].commit` is the
-canonical checkout pin, and `workspace.repos[].base_commit` is only a
-SWE-Bench-friendly alias for the same value.
+canonical checkout pin. SWE-Bench `base_commit` is upstream/import vocabulary
+that adapters may translate into `commit`; it should not become the canonical
+AgentV workspace field.
 
 Harbor owns benchmark-grade execution for standard suites such as SWE-Bench
 Verified, Multi-SWE-Bench, Terminal-Bench, and suites with Harbor-specific
@@ -39,101 +40,79 @@ Harbor should own:
 - Harbor `task.toml` files and Harbor YAML config;
 - Opik trace upload through Harbor when enabled.
 
-## Alignment with experiment separation
+## Alignment with inline experiment runtime
 
-The 2026-06-23 experiment/eval separation decision makes runtime binding an
-experiment concern. Harbor execution should follow the same split:
+The 2026-06-26 inline experiment decision keeps runtime binding inside the
+single runnable `eval.yaml` artifact. Harbor execution should follow the same
+split:
 
 - AgentV eval YAML remains the authoring or selection surface for what benchmark
   suite is being evaluated.
-- AgentV experiment YAML selects or pins the Harbor runner, candidate
+- The inline `experiment:` block selects or pins the Harbor runner, candidate
   agent/model, run policy, and other runtime binding.
 - Harbor-authored YAML remains Harbor's own config surface when the standard
   suite needs Harbor-specific task packaging or verifier settings.
 
 This means the examples below show the desired logical fields, but new
-runtime fields should be placed on an experiment unless they are genuinely part
-of the benchmark suite identity. Do not put candidate agent/model binding in the
-eval file for new AgentV-native examples.
+runtime fields should be placed under `experiment:` unless they are genuinely
+part of the benchmark suite identity. Do not put candidate agent/model binding
+under `source` for new AgentV-native examples.
 
 ## Minimal future config surface
 
-An AgentV eval suite can select the benchmark source without copying Harbor's
-task schema or claiming to be the runtime runner:
+An AgentV eval suite should not gain a generic top-level `source` field just to
+select Harbor. ADR 0009 keeps benchmark-shaped evals on existing primitives:
+workspace setup belongs in `workspace`, runtime binding belongs in
+`experiment`, and imported benchmark provenance belongs in metadata, sidecars,
+or adapter manifests.
 
-```yaml
-name: swebench-verified
-
-source:
-  type: harbor
-  dataset: swebench-verified
-```
-
-The corresponding experiment selects how that suite runs:
+If a future Harbor adapter needs first-class selection in AgentV YAML, it should
+be designed as a narrow runner/import selector after real usage, not as a broad
+benchmark source schema. The corresponding inline experiment block still selects
+how that suite runs:
 
 ```yaml
 name: swebench-verified-codex
-target: codex-gpt5-mini
-evals: evals/swebench-verified.eval.yaml
-runner:
-  type: harbor
-  options:
-    opik:
-      enabled: true
-```
 
-For a Harbor-authored YAML file, use `config` instead of `dataset`:
-
-```yaml
-source:
-  type: harbor
-  config: ./harbor/swebench-verified.yaml
+experiment:
+  target: codex
+  model: openai/gpt-5-mini
+  runner:
+    type: harbor
+    options:
+      opik:
+        enabled: true
 ```
 
-The first implementation should accept exactly one Harbor source selector:
-`dataset` for a known Harbor dataset id, or `config` for an existing Harbor YAML
-file. There should be no precedence rule between them. If both are set, fail
-validation and ask the user to choose one.
-
-Do not combine Harbor suite selection with candidate binding in the eval file:
+Do not combine Harbor suite selection with candidate binding in an invented
+source block:
 
 ```yaml
 # Avoid in eval.yaml
-execution:
+source:
   runner: harbor
-  harbor:
-    dataset: swebench-verified
-    agent: codex
-    model: openai/gpt-5-mini
+  dataset: swebench-verified
+  agent: codex
+  model: openai/gpt-5-mini
 ```
 
-Split that shape across the suite and experiment instead:
+Keep runtime binding in `experiment:` instead:
 
 ```yaml
-# evals/swebench-verified.eval.yaml
 name: swebench-verified
-source:
-  type: harbor
-  dataset: swebench-verified
-```
 
-```yaml
-# experiments/swebench-verified-codex.yaml
-name: swebench-verified-codex
-target: codex
-model: openai/gpt-5-mini
-evals: evals/swebench-verified.eval.yaml
-runner:
-  type: harbor
+experiment:
+  target: codex
+  model: openai/gpt-5-mini
+  runner:
+    type: harbor
 ```
 
-Keep Harbor suite source selection under `source` in the eval suite. Keep
-experiment-side runner selection under `runner.type`, with runner knobs under
-`runner.options`. The eval suite answers "where do these cases come from?"; the
-experiment answers "how is this run executed?" Do not use `execution.runner` in
-new eval-suite examples because that name collides with the experiment runner.
-Do not repeat the runner discriminator as `runner.harbor.options`; `type:
-harbor` already provides that namespace.
+Keep runtime runner selection under `experiment.runner.type`, with runner knobs
+under `experiment.runner.options`. Do not use `execution.runner` in new
+eval-suite examples because top-level `execution:` is only a legacy alias for
+old eval files. Do not repeat the runner discriminator as
+`runner.harbor.options`; `type: harbor` already provides that namespace.
 
 Do not add top-level AgentV fields for Harbor task packaging, verifier images,
 task patches, or Docker/Compose adapter settings. If a Harbor option becomes too
@@ -153,8 +132,9 @@ agentv eval evals/native.eval.yaml --target codex
 ```
 
 Harbor-backed evals should use the same top-level entrypoint. If no explicit
-experiment runner is configured, AgentV may infer Harbor execution from
-`source.type: harbor`:
+experiment runner is configured, a future adapter may infer Harbor execution
+from an adapter-owned manifest or CLI flag, but this ADR does not add that
+schema:
 
 ```bash
 agentv eval evals/swebench-harbor.eval.yaml
@@ -176,9 +156,9 @@ agentv results import harbor --job <harbor-job-id>
 ```
 
 Do not overload native `--target` semantics in the first Harbor runner slice.
-Harbor `agent`, `model`, and matrix behavior should come from the experiment or
-the referenced Harbor YAML until repeated usage proves a shared AgentV flag is
-needed.
+Harbor `agent`, `model`, and matrix behavior should come from inline
+`experiment:` runtime fields or the referenced Harbor YAML until repeated usage
+proves a shared AgentV flag is needed.
 
 ## Unsupported fields and non-goals
 
@@ -191,6 +171,8 @@ The Harbor runner mode should not add or interpret:
   `base_commit` as Harbor runner inputs;
 - generic `extra_args` or arbitrary pass-through maps in the initial AgentV
   surface.
+- generic top-level `source` as a replacement for AgentV `workspace` or
+  metadata conventions.
 
 These fields remain valid in native AgentV evals when authors compose their own
 workspace, hooks, and graders. They are non-goals only for the Harbor-backed
@@ -199,9 +181,9 @@ standard-suite path.
 ## Implementation sequencing
 
 1. Document the native-vs-Harbor boundary and commit alias rules.
-2. Add schema validation for eval-suite `source.type: harbor` and exactly one of
-   `source.dataset` or `source.config`, plus experiment `runner.type` and
-   `runner.options`, with no changes to native workspace acquisition.
+2. Add a narrow Harbor runner/import selector only after repeated usage proves
+   it is needed; keep inline `experiment.runner.type` and
+   `experiment.runner.options`, with no changes to native workspace acquisition.
 3. Add a Harbor launch adapter that records job identity and status.
 4. Add a Harbor result importer that maps rewards, exceptions, timings,
    artifacts, and Opik trace URLs into AgentV run bundles.

diff --git a/docs/adr/0006-separate-experiments-from-eval-definitions.md b/docs/adr/0006-separate-experiments-from-eval-definitions.md
@@ -28,6 +28,9 @@ The final design keeps the product boundary smaller:
 - `experiment:` is an inline run-time block inside `eval.yaml`.
 - `tests:` is the composition, import, and selection surface.
 - result bundles are written under `.agentv/results/<eval-name>/<timestamp>/`.
+- A directory named `experiments/` may be used as a user-owned repo convention
+  for wrapper eval YAML files, but it does not create a separate experiment
+  artifact type or schema-significant path.
 
 This keeps AgentV repo-native and zero-infra by default, avoids a new public
 artifact type, and still lets wrapper evals run multiple imported suites with a
@@ -45,7 +48,10 @@ Do not introduce or document:
 - committed experiment files as the canonical authoring path
 
 The only runnable authoring artifact is `eval.yaml` or another `*.eval.yaml`
-file. Runtime controls live in an inline `experiment:` block:
+file. A project may place wrapper eval files under an `experiments/` directory
+when their main job is to bind runtime policy over reusable suites, but those
+files are still ordinary eval YAML files. AgentV must not infer behavior from
+that directory name. Runtime controls live in an inline `experiment:` block:
 
 ```yaml
 name: cargowise-sql-migration-codex
@@ -103,6 +109,28 @@ The old experiment runtime fields are ported into the parent eval file:
 Suite or case workspace fields remain task-owned when they define what is being
 evaluated.
 
+## Contract Layers
+
+Parent-versus-child is not the main composition rule. Contract ownership is:
+
+| Layer | AgentV fields | Owner in `type: suite` composition |
+| --- | --- | --- |
+| Task data | `tests`, case `metadata`, `expected_output` | Imported child suite |
+| Task prompt | `input`, `input_files`, shared prompt defaults | Imported child suite |
+| Task environment | `workspace`, `workspace.repos[]`, templates, workspace hooks | Imported child suite |
+| Scoring | `assertions`, graders, expected references | Imported child suite |
+| Run policy | `experiment`, CLI target flags, workers, repeat, gates, budget | Parent wrapper eval or CLI |
+| Target runtime | selected target config and `targets[].hooks` | Selected target |
+
+`workspace` can influence what an agent perceives through tools, but it is not
+prompt input. `input` is what the agent is told; `workspace` is what the agent
+can act on; `assertions` are how AgentV judges the result; `experiment` is how
+the run is bound, repeated, compared, and gated.
+
+This framing removes the apparent "child wins except experiment" exception.
+Child suites own task contracts. Parent wrapper evals and CLI flags own run
+contracts.
+
 ## Lifecycle Ownership
 
 `experiment:` owns evaluation policy, not lifecycle mutation. Commands that
@@ -219,6 +247,25 @@ fields. Explicit override syntax can be considered later if a concrete use case
 needs it, but the default composition model must not merge task contracts in a
 surprising way.
 
+If a parent eval defines `workspace` and imports child eval suites with
+`type: suite`, the parent workspace applies only to raw cases owned by the
+parent file. Imported suite tests keep their child suite workspace. This is a
+valid mixed-case pattern when the parent owns raw cases, but it is usually a
+DX smell when every test is a `type: suite` import. AgentV should warn or lint
+that shape rather than silently implying a parent workspace override.
+
+If a parent eval has no `experiment:` and imports child suites that do have
+`experiment:` blocks, child runtime still does not fall back into the parent
+run. AgentV should warn because authors often expect the child runtime to be
+used. The correct choices are to run the child suite directly, add a parent
+`experiment:` block, or pass CLI runtime flags.
+
+Wrapper evals that import multiple suites with distinct shared workspace
+contracts should fail fast or require per-test isolation, separate runs, or an
+explicit future composition mode. Shared workspace setup is safe when one suite
+owns the task contract; it is not a place for implicit parent-child or
+child-child workspace merging.
+
 ## Runtime Overrides
 
 The parent `experiment:` block is the default runtime policy for the whole eval.
@@ -330,6 +377,13 @@ remain inspectable without reading every manifest row. Test artifacts from tests
 owned directly by the wrapper eval can still live directly under `<test-id>`.
 All cases should also retain source suite metadata in manifests and index rows.
 
+The result namespace remains `experiment` in artifacts and Dashboard. AgentV
+should not introduce a separate authored `run_group` field. For better DX,
+Dashboard and reports may derive display-only runtime source labels such as
+`inline experiment`, `CLI`, `defaults`, or `mixed`, and may show the top-level
+eval file plus imported suites. Those labels are explanations over existing
+primitives, not new configuration surface.
+
 ## Consequences
 
 Positive:
@@ -350,14 +404,22 @@ Negative:
   evals.
 - Explicit task-context override syntax is deferred, so authors who need
   overrides must create a new suite or wait for a focused override design.
+- Wrapper evals need diagnostics so authors understand that parent workspace
+  does not override imported suite workspaces and child experiment blocks are
+  ignored.
 
 ## Non-Goals
 
-- Do not add separate `experiment.yaml` files or an `experiments/` convention.
+- Do not add separate `experiment.yaml` files.
+- Do not make `experiments/` a schema-significant directory or separate
+  artifact type; it may only be a repo layout convention for ordinary wrapper
+  eval YAML files.
 - Do not add config pointers to external experiment files.
-- Do not present committed experiment files as canonical docs examples.
+- Do not present committed non-eval experiment files as canonical docs examples.
 - Do not make child suite runtime blocks participate in parent wrapper runtime
   selection.
+- Do not add an authored `run_group` field.
+- Do not implicitly merge parent and child workspaces for `type: suite` imports.
 - Do not silently override imported suite task fields from parent suite fields.
 - Do not encode source suite membership by adding redundant default result path
   segments.