Skip to content

fix(eval): tighten workspace composition contracts#1547

Merged
christso merged 13 commits into
mainfrom
implement-composition-dx-beads
Jun 28, 2026
Merged

fix(eval): tighten workspace composition contracts#1547
christso merged 13 commits into
mainfrom
implement-composition-dx-beads

Conversation

@christso

@christso christso commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

Summary

Eval composition now has a strict workspace ownership rule: wrapper evals can import child suites and set runtime policy, but they cannot define parent workspace-affecting fields when any type: suite import is present. That blocks top-level workspace, experiment.workspace, and legacy execution.workspace from silently overriding a child suite's task environment.

Shared workspace setup is owner-scoped. Raw parent cases or workspace-less imported cases no longer inherit a child suite's shared workspace path or .code-workspace file just because that suite needed setup.

Setup-only workspace fields now trigger shared setup too. Env preflight checks and Docker setup run even when a case defines only workspace.env or only workspace.docker without repos, hooks, templates, or a path.

Provider batch dispatch is disabled for workspace-enabled evals, so workspace setup, per-case workspace files, and scoped dispatch cannot be bypassed by a batch-capable provider path.

The workspace isolation vocabulary is tightened across schema, parser, validator, and experiment normalization: per_case is the only accepted per-case isolation value. Legacy per_test now fails instead of degrading into unclear behavior.

experiment.workspace is now aligned with what the runner actually consumes: it accepts only runtime mode and path overrides. Repos, hooks, templates, Docker config, and isolation stay in top-level or case-level workspace.

Design Notes

Area Decision
Suite imports type: suite preserves child suite context, including child workspace.
Wrapper workspace Parent workspace-affecting fields are rejected when a wrapper imports eval suites.
Tests/raw imports type: tests and raw parent-owned cases may still use parent workspace.
Experiment policy Parent wrapper evals own composed experiment/runtime policy; child experiments are ignored.
Experiment workspace experiment.workspace is limited to runtime mode and path.
Shared setup selection Workspace env and docker configs count as setup-affecting fields even when no repo/template/path is present.
Workspace file scoping Suite .code-workspace files are passed only to cases that receive that suite's workspace.
Provider batching Batch mode falls back to per-case dispatch when any workspace setup is required.
Isolation spelling workspace.isolation accepts shared or per_case; per_test is removed.
Runtime labels Run artifacts, reports, and Dashboard derive runtime source without adding run_group or a separate experiment file.
Pooling Pooling remains a runtime/materialization strategy for now; simplification is tracked separately in av-0ys.

Verification

  • bun --filter @agentv/core test — 2053 pass, 0 fail
  • bun run typecheck — pass
  • bun run lint — pass
  • git diff --check — pass
  • bun test packages/core/test/evaluation/workspace/setup.test.ts packages/core/test/evaluation/workspace/docker-workspace.test.ts -t "runs shared setup for env-only workspace configs|runs shared setup for docker-only workspace configs|DockerWorkspaceProvider" — 35 pass
  • bun test packages/core/test/evaluation/workspace/setup.test.ts packages/core/test/evaluation/orchestrator.test.ts — 103 pass
  • bun test packages/core/test/evaluation/workspace/setup.test.ts -t "does not apply a child suite shared workspace to raw cases with no workspace" — 1 pass
  • bun test packages/core/test/evaluation/orchestrator.test.ts -t "does not pass suite workspaceFile to a case without the shared workspace|disables provider batching when cases require workspace setup" — 2 pass
  • bun test packages/core/test/evaluation/validation/eval-validator.test.ts -t "experiment workspace|tests string path|workspace repo validation|parent workspace|legacy execution workspace|task workspace fields|runtime workspace overrides" — 18 pass
  • bun test packages/core/test/evaluation/repo-schema-validation.test.ts packages/core/test/evaluation/experiment.test.ts packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts -t "schema|per_test|per_case|experiment workspace|runtime workspace|task workspace|normalize|sync" — 21 pass
  • bun test packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts packages/core/test/evaluation/repo-schema-validation.test.ts packages/core/test/evaluation/experiment.test.ts packages/core/test/evaluation/workspace/setup.test.ts packages/core/test/evaluation/orchestrator.test.ts packages/core/test/evaluation/eval-inline-experiment.test.ts -t "schema|per_test|per_case|experiment workspace|raw cases|shared workspace|--workspace flag|workspace isolation|parent workspace|legacy execution workspace|tests string path|runtime workspace|task workspace|type: suite" — 43 pass

Tracker

Implementation Beads are intentionally left in_progress until this PR lands: av-ha5, av-pkp, av-58q, av-82t, and av-ldm. Follow-ups: av-dxp covers the architecture-regression eval, and av-0ys covers possible pooling authoring-model simplification.


Compound Engineering
Codex

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 27, 2026

Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 352dd42
Status: ✅  Deploy successful!
Preview URL: https://9d645225.agentv.pages.dev
Branch Preview URL: https://implement-composition-dx-bea.agentv.pages.dev

View logs

Make per_case the only per-case workspace isolation spelling across the public schema, SDK/core types, parser, validator, docs, and generated eval schema. Invalid legacy per_test values now fail instead of silently becoming shared behavior, and static workspace conflict detection checks per-case case workspaces directly.
@christso christso changed the title fix(eval): forbid ambiguous wrapper workspaces fix(eval): tighten workspace composition contracts Jun 27, 2026
christso added 5 commits June 28, 2026 00:11
Keep experiment.workspace aligned with the runtime knobs the runner actually consumes by limiting it to mode/path, and validate that contract consistently for string-shorthand test imports.
@christso christso marked this pull request as ready for review June 28, 2026 01:06
@christso christso merged commit 9022d87 into main Jun 28, 2026
8 checks passed
@christso christso deleted the implement-composition-dx-beads branch June 28, 2026 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant