feat(evals): Layer 3 build runner + PR-gate CI workflow#570
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Live run deletes working dir before scorers read it
- Removed the premature rmSync() call in runLive() that deleted workingDir before scorers could read files from it, allowing all file-content-based scorers to function correctly during live eval runs.
Or push these changes by commenting:
@cursor push cb9a21180b
Preview (cb9a21180b)
diff --git a/evals/runner/invoke-wizard.ts b/evals/runner/invoke-wizard.ts
--- a/evals/runner/invoke-wizard.ts
+++ b/evals/runner/invoke-wizard.ts
@@ -246,14 +246,11 @@
const parsed = parseStream(stdout);
const fsSnapshot = captureFsSnapshot(pristineDir, workingDir);
- // Best-effort teardown. If this fails the next run will still
- // start by removing `workingDir` (mintWorkingDir is keyed on runId,
- // which is fresh per call, so a leak only costs tmpdir space).
- try {
- rmSync(workingDir, { recursive: true });
- } catch {
- // ignore — diagnostics will surface a stale dir
- }
+ // Note: workingDir is NOT deleted here because scorers need to read
+ // files from it. The directory is under os.tmpdir() and keyed by
+ // runId (fresh per call), so multiple runs won't collide. The caller
+ // (run-eval.ts) can clean it up after scoring, or let the OS
+ // eventually reclaim tmpdir space.
return {
artifact: {You can send follow-ups to the cloud agent here.
c360d99 to
411d2d3
Compare
|
Rebased onto main (force-push). The branch had drifted significantly behind — main got #601's call-site eval framework which conflicted on Approach: cherry-picked onto pr-567's already-rebased tip + my one new commit, since #570's older commits are already on main via prior merges (cherry-picking them would have produced empty patches). Resolved one conflict on Stack now (6 commits ahead of main, all yours + kelsonpw + me):
Verification: Note: this PR now contains the same first 5 commits as #567. Once #567 merges, those will collapse out automatically on the next rebase. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Workflow path filter references wrong filename for self-trigger
- Changed the path filter from 'evals-pr.yml' (the separate call-site gate workflow) to 'evals-pr-scenarios.yml' (the correct self-reference) so changes to this workflow file trigger CI validation.
Or push these changes by commenting:
@cursor push 584917877b
Preview (584917877b)
diff --git a/.github/workflows/evals-pr-scenarios.yml b/.github/workflows/evals-pr-scenarios.yml
--- a/.github/workflows/evals-pr-scenarios.yml
+++ b/.github/workflows/evals-pr-scenarios.yml
@@ -20,7 +20,7 @@
- 'skills/integration/**'
- 'evals/**'
- 'package.json'
- - '.github/workflows/evals-pr.yml'
+ - '.github/workflows/evals-pr-scenarios.yml'
# Run-once-per-ref semantics: a new push to the PR cancels the prior
# run. PR-gate evals are deterministic against golden replays today,You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit 411d2d3. Configure here.
| - 'skills/integration/**' | ||
| - 'evals/**' | ||
| - 'package.json' | ||
| - '.github/workflows/evals-pr.yml' |
There was a problem hiding this comment.
Workflow path filter references wrong filename for self-trigger
Low Severity
The workflow file evals-pr-scenarios.yml includes '.github/workflows/evals-pr.yml' in its path triggers, but that's a separate pre-existing workflow (the call-site gate). The intended self-reference to enable CI validation of changes to this workflow should be '.github/workflows/evals-pr-scenarios.yml'. The PR description even refers to the new file as evals-pr.yml, confirming the developer thought that was the filename — the rename to avoid collision with the existing file wasn't reflected in this path entry.
Reviewed by Cursor Bugbot for commit 411d2d3. Configure here.
411d2d3 to
d06d185
Compare
Vanilla React + Vite SPA — the "plain SPA" cohort with no
meta-framework router. Tests VITE_ env prefix and the standard
src/main.tsx entry. Distinct from RR7-data: no router config, init
lands directly in main.tsx alongside createRoot.
Pristine: minimal create-vite-style React+TS starter (package.json,
tsconfig, vite.config, index.html, src/main.tsx, src/App.tsx).
Golden adds @amplitude/unified, .env.local with VITE_AMPLITUDE_API_KEY,
init() in main.tsx, and a track('Sign Up Submitted') call in App.tsx.
Verified: pnpm evals:run react-vite/vanilla → 50/50, contractOk,
hardFailed=false. All Layer 0 + Layer 1 scorers pass.
Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the
chain once #560 → #567 → #570 land.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 5 of Week 2 — proof that the eval suite actually catches the
regressions it was built to catch. Each test forks the existing
nextjs-app-router/vanilla golden into a tmpdir, applies a deliberate
regression, and asserts the full scorer stack reports the failure
shape we expect.
Three controls land here:
1. Wrong SDK family — substitute @amplitude/analytics-browser for
@amplitude/unified in the golden's package.json. L0-correct-sdk-package
hard-fails with a detail naming the wrong family.
2. Hardcoded API key — plant the eval-only API key literal in the
entry file. L0-no-hardcoded-key hard-fails.
3. Confirmed event without a track() call — strip a track('Sign In')
while keeping it in the proposed plan. L1-confirmed-events-tracked
fails with a detail naming 'Sign In'.
These run cheaply (sub-second per test) so they live alongside the
success-path runner.test.ts. If a future scorer change makes a
negative control unexpectedly pass, the suite caught a regression in
itself — that's the point.
Verified: 34/34 tests across 6 files; lint clean.
Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the
chain.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kelsonpw
left a comment
There was a problem hiding this comment.
LGTM. Layer 3 build runner is well scoped — pnpm install + scenario buildCommand, redacted stderr tail, separate exitCodes for install vs build (criterion detail distinguishes lockfile drift from a wizard regression). The skip-passes-when-no-buildResult posture is the right default for a feature that lands before any scenario actually carries a build-result.json fixture.\n\nThe PR-gate workflow is sensibly path-filtered to wizard agent surface + skills + evals, with concurrency cancelling stale runs and Ring 1 matrix in parallel. Bugbot's low-severity self-trigger filename note (evals-pr.yml vs evals-pr-scenarios.yml at line 23 of the workflow) is worth fixing in a follow-up so changes to this workflow actually run it.\n\nNode 24 unit test failure (DataIngestionCheckScreen tracking-plan MCP polling) is a pre-existing flake unrelated to this PR — same test failed on 575/577 in the same window. The eval scorers + Ring 1 gate are green. Unblocks the scenario-fixture chain (#575 / #576 / #571).
Vanilla React + Vite SPA — the "plain SPA" cohort with no
meta-framework router. Tests VITE_ env prefix and the standard
src/main.tsx entry. Distinct from RR7-data: no router config, init
lands directly in main.tsx alongside createRoot.
Pristine: minimal create-vite-style React+TS starter (package.json,
tsconfig, vite.config, index.html, src/main.tsx, src/App.tsx).
Golden adds @amplitude/unified, .env.local with VITE_AMPLITUDE_API_KEY,
init() in main.tsx, and a track('Sign Up Submitted') call in App.tsx.
Verified: pnpm evals:run react-vite/vanilla → 50/50, contractOk,
hardFailed=false. All Layer 0 + Layer 1 scorers pass.
Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the
chain once #560 → #567 → #570 land.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d06d185 to
624a524
Compare
Phase 5 of Week 2 — proof that the eval suite actually catches the
regressions it was built to catch. Each test forks the existing
nextjs-app-router/vanilla golden into a tmpdir, applies a deliberate
regression, and asserts the full scorer stack reports the failure
shape we expect.
Three controls land here:
1. Wrong SDK family — substitute @amplitude/analytics-browser for
@amplitude/unified in the golden's package.json. L0-correct-sdk-package
hard-fails with a detail naming the wrong family.
2. Hardcoded API key — plant the eval-only API key literal in the
entry file. L0-no-hardcoded-key hard-fails.
3. Confirmed event without a track() call — strip a track('Sign In')
while keeping it in the proposed plan. L1-confirmed-events-tracked
fails with a detail naming 'Sign In'.
These run cheaply (sub-second per test) so they live alongside the
success-path runner.test.ts. If a future scorer change makes a
negative control unexpectedly pass, the suite caught a regression in
itself — that's the point.
Verified: 34/34 tests across 6 files; lint clean.
Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the
chain.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4 of Week 2 — Layer 3 grades whether the integration still
compiles after the wizard runs, and a workflow gates PRs on Layers
0-3 with a Ring 1 matrix.
* runner/build.ts — runBuild() spawns `pnpm install` then the
scenario.buildCommand inside the working tree, captures stderr,
redacts via the same observability/redact helper used for stdout.
Throws only on timeouts/spawn errors; non-zero exit codes pass
through to the BuildResult so the scorer can grade.
* scorers/layer3-build/build-passes.ts — heavy 10pts. Skip-passes
with weight 0 when no buildResult is on the artifact (current
state on the existing golden); passes with weight 10 when the
build exits 0; fails with the redacted stderr tail otherwise.
Distinguishes install failure (likely lockfile drift) from
build failure (likely a wizard regression) in the detail.
* runner/types.ts — new BuildResult type; Artifact gains optional
buildResult field.
* runner/invoke-wizard.ts — runReplay loads optional
golden/build-result.json so fixtures can pin a recorded outcome.
runLive wiring is deferred — live runs require the eval-only
Amplitude project (open decision #2) and pnpm install costs are
not worth incurring on every dev-checkout run today.
* runner/score.ts — orchestrator's downstream loop already covered
Layer 3 via the layer > 0 filter; no change to the loop, just
the scorer registration.
* .github/workflows/evals-pr.yml — pull_request workflow,
paths-filtered to the wizard agent surface + skills + evals,
runs vitest on evals/ and a Ring 1 scenario matrix in parallel.
Concurrency cancels stale runs. Artifacts uploaded for triage.
Actions pinned to commit SHAs per the repo's other workflows.
Verified:
* pnpm exec vitest run evals/ — 31/31 across 5 files
* pnpm evals:run nextjs-app-router/vanilla — 50/50, contractOk
* pnpm lint clean
Stacked on PR #567 (ba-104-runner-infra); will rebase onto main
after #567 lands.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
624a524 to
6ee0853
Compare
Phase 5 of Week 2 — proof that the eval suite actually catches the
regressions it was built to catch. Each test forks the existing
nextjs-app-router/vanilla golden into a tmpdir, applies a deliberate
regression, and asserts the full scorer stack reports the failure
shape we expect.
Three controls land here:
1. Wrong SDK family — substitute @amplitude/analytics-browser for
@amplitude/unified in the golden's package.json. L0-correct-sdk-package
hard-fails with a detail naming the wrong family.
2. Hardcoded API key — plant the eval-only API key literal in the
entry file. L0-no-hardcoded-key hard-fails.
3. Confirmed event without a track() call — strip a track('Sign In')
while keeping it in the proposed plan. L1-confirmed-events-tracked
fails with a detail naming 'Sign In'.
These run cheaply (sub-second per test) so they live alongside the
success-path runner.test.ts. If a future scorer change makes a
negative control unexpectedly pass, the suite caught a regression in
itself — that's the point.
Verified: 34/34 tests across 6 files; lint clean.
Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the
chain.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#577) Phase 5 of Week 2 — proof that the eval suite actually catches the regressions it was built to catch. Each test forks the existing nextjs-app-router/vanilla golden into a tmpdir, applies a deliberate regression, and asserts the full scorer stack reports the failure shape we expect. Three controls land here: 1. Wrong SDK family — substitute @amplitude/analytics-browser for @amplitude/unified in the golden's package.json. L0-correct-sdk-package hard-fails with a detail naming the wrong family. 2. Hardcoded API key — plant the eval-only API key literal in the entry file. L0-no-hardcoded-key hard-fails. 3. Confirmed event without a track() call — strip a track('Sign In') while keeping it in the proposed plan. L1-confirmed-events-tracked fails with a detail naming 'Sign In'. These run cheaply (sub-second per test) so they live alongside the success-path runner.test.ts. If a future scorer change makes a negative control unexpectedly pass, the suite caught a regression in itself — that's the point. Verified: 34/34 tests across 6 files; lint clean. Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the chain. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>



Summary
Phase 4 of Week 2 — Layer 3 grades whether the integration still compiles after the wizard runs, plus a workflow that gates PRs on Layers 0–3 with a Ring 1 matrix.
runner/build.ts—runBuild()doespnpm install+scenario.buildCommand, captures stderr, redacts viaobservability/redact. Throws only on timeouts/spawn errors; non-zero exit codes pass through toBuildResultso the scorer grades.scorers/layer3-build/build-passes.ts— heavy 10pts. Skip-passes (weight 0) when nobuildResultis on the artifact; passes whenexitCode === 0; fails with the redacted stderr tail otherwise. Distinguishes install failure from build failure in the detail.runner/types.ts— newBuildResulttype;Artifact.buildResult?field.runner/invoke-wizard.ts—runReplayloads optionalgolden/build-result.json.runLivewiring deferred (live runs need the eval-only Amplitude project + would incurpnpm installon every dev-checkout run)..github/workflows/evals-pr.yml—pull_requestworkflow, paths-filtered to the wizard agent surface + skills + evals. Matrix runs Ring 1 scenarios in parallel, vitest covers scorer/runner tests. Concurrency cancels stale runs. Actions pinned to commit SHAs per repo convention.Test plan
pnpm exec vitest run evals/→ 31/31 across 5 filespnpm evals:run nextjs-app-router/vanilla→ 50/50, contractOk (Layer 3 skip-passes since the existing golden has nobuild-result.json)pnpm lintcleangolden/build-result.json, the L3 scorer should immediately participate. Verify on the next scenario fixture PR.Stacked on PR #567 (
ba-104-runner-infra). Will rebase onto main after #567 lands.🤖 Generated with Claude Code
Note
Medium Risk
Adds a new scoring layer and CI gating workflow; while scoped to eval infrastructure, it changes pass/fail signals for PRs and introduces new runner behavior (build/install execution + artifact schema changes).
Overview
Adds Layer 3 build/typecheck coverage to the eval suite by introducing a
runBuildrunner that executespnpm install+ the scenariobuildCommand, captures/redacts stderr, and records aBuildResulton the scenarioArtifact.Extends golden replays to optionally load
golden/build-result.json, wires the newL3-build-passesscorer into the scorer stack, and adds tests (including negative controls) to assert the suite fails on known regressions.Introduces a new PR-gate GitHub Actions workflow (
evals-pr-scenarios.yml) that path-filters relevant changes and runs vitest plus a Ring 1 scenario matrix, uploading eval reports as artifacts.Reviewed by Cursor Bugbot for commit 1915919. Bugbot is set up for automated code reviews on this repo. Configure here.