fix(eval): tighten workspace composition contracts#1547
Merged
Conversation
Deploying agentv with
|
| Latest commit: |
352dd42
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://9d645225.agentv.pages.dev |
| Branch Preview URL: | https://implement-composition-dx-bea.agentv.pages.dev |
Make per_case the only per-case workspace isolation spelling across the public schema, SDK/core types, parser, validator, docs, and generated eval schema. Invalid legacy per_test values now fail instead of silently becoming shared behavior, and static workspace conflict detection checks per-case case workspaces directly.
Keep experiment.workspace aligned with the runtime knobs the runner actually consumes by limiting it to mode/path, and validate that contract consistently for string-shorthand test imports.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Eval composition now has a strict workspace ownership rule: wrapper evals can import child suites and set runtime policy, but they cannot define parent workspace-affecting fields when any
type: suiteimport is present. That blocks top-levelworkspace,experiment.workspace, and legacyexecution.workspacefrom silently overriding a child suite's task environment.Shared workspace setup is owner-scoped. Raw parent cases or workspace-less imported cases no longer inherit a child suite's shared workspace path or
.code-workspacefile just because that suite needed setup.Setup-only workspace fields now trigger shared setup too. Env preflight checks and Docker setup run even when a case defines only
workspace.envor onlyworkspace.dockerwithout repos, hooks, templates, or a path.Provider batch dispatch is disabled for workspace-enabled evals, so workspace setup, per-case workspace files, and scoped dispatch cannot be bypassed by a batch-capable provider path.
The workspace isolation vocabulary is tightened across schema, parser, validator, and experiment normalization:
per_caseis the only accepted per-case isolation value. Legacyper_testnow fails instead of degrading into unclear behavior.experiment.workspaceis now aligned with what the runner actually consumes: it accepts only runtimemodeandpathoverrides. Repos, hooks, templates, Docker config, and isolation stay in top-level or case-levelworkspace.Design Notes
type: suitepreserves child suite context, including childworkspace.type: testsand raw parent-owned cases may still use parent workspace.experiment.workspaceis limited to runtimemodeandpath.envanddockerconfigs count as setup-affecting fields even when no repo/template/path is present..code-workspacefiles are passed only to cases that receive that suite's workspace.workspace.isolationacceptssharedorper_case;per_testis removed.run_groupor a separate experiment file.av-0ys.Verification
bun --filter @agentv/core test— 2053 pass, 0 failbun run typecheck— passbun run lint— passgit diff --check— passbun test packages/core/test/evaluation/workspace/setup.test.ts packages/core/test/evaluation/workspace/docker-workspace.test.ts -t "runs shared setup for env-only workspace configs|runs shared setup for docker-only workspace configs|DockerWorkspaceProvider"— 35 passbun test packages/core/test/evaluation/workspace/setup.test.ts packages/core/test/evaluation/orchestrator.test.ts— 103 passbun test packages/core/test/evaluation/workspace/setup.test.ts -t "does not apply a child suite shared workspace to raw cases with no workspace"— 1 passbun test packages/core/test/evaluation/orchestrator.test.ts -t "does not pass suite workspaceFile to a case without the shared workspace|disables provider batching when cases require workspace setup"— 2 passbun test packages/core/test/evaluation/validation/eval-validator.test.ts -t "experiment workspace|tests string path|workspace repo validation|parent workspace|legacy execution workspace|task workspace fields|runtime workspace overrides"— 18 passbun test packages/core/test/evaluation/repo-schema-validation.test.ts packages/core/test/evaluation/experiment.test.ts packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts -t "schema|per_test|per_case|experiment workspace|runtime workspace|task workspace|normalize|sync"— 21 passbun test packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts packages/core/test/evaluation/repo-schema-validation.test.ts packages/core/test/evaluation/experiment.test.ts packages/core/test/evaluation/workspace/setup.test.ts packages/core/test/evaluation/orchestrator.test.ts packages/core/test/evaluation/eval-inline-experiment.test.ts -t "schema|per_test|per_case|experiment workspace|raw cases|shared workspace|--workspace flag|workspace isolation|parent workspace|legacy execution workspace|tests string path|runtime workspace|task workspace|type: suite"— 43 passTracker
Implementation Beads are intentionally left
in_progressuntil this PR lands:av-ha5,av-pkp,av-58q,av-82t, andav-ldm. Follow-ups:av-dxpcovers the architecture-regression eval, andav-0yscovers possible pooling authoring-model simplification.