feat(evals): Layer 3 build runner + PR-gate CI workflow by kaiapeacock-eng · Pull Request #570 · amplitude/wizard

kaiapeacock-eng · 2026-05-06T20:16:46Z

Summary

Phase 4 of Week 2 — Layer 3 grades whether the integration still compiles after the wizard runs, plus a workflow that gates PRs on Layers 0–3 with a Ring 1 matrix.

runner/build.ts — runBuild() does pnpm install + scenario.buildCommand, captures stderr, redacts via observability/redact. Throws only on timeouts/spawn errors; non-zero exit codes pass through to BuildResult so the scorer grades.
scorers/layer3-build/build-passes.ts — heavy 10pts. Skip-passes (weight 0) when no buildResult is on the artifact; passes when exitCode === 0; fails with the redacted stderr tail otherwise. Distinguishes install failure from build failure in the detail.
runner/types.ts — new BuildResult type; Artifact.buildResult? field.
runner/invoke-wizard.ts — runReplay loads optional golden/build-result.json. runLive wiring deferred (live runs need the eval-only Amplitude project + would incur pnpm install on every dev-checkout run).
.github/workflows/evals-pr.yml — pull_request workflow, paths-filtered to the wizard agent surface + skills + evals. Matrix runs Ring 1 scenarios in parallel, vitest covers scorer/runner tests. Concurrency cancels stale runs. Actions pinned to commit SHAs per repo convention.

Test plan

pnpm exec vitest run evals/ → 31/31 across 5 files
pnpm evals:run nextjs-app-router/vanilla → 50/50, contractOk (Layer 3 skip-passes since the existing golden has no build-result.json)
pnpm lint clean
Reviewer: when this lands and a scenario brings a golden/build-result.json, the L3 scorer should immediately participate. Verify on the next scenario fixture PR.

Stacked on PR #567 (ba-104-runner-infra). Will rebase onto main after #567 lands.

🤖 Generated with Claude Code

Note

Medium Risk
Adds a new scoring layer and CI gating workflow; while scoped to eval infrastructure, it changes pass/fail signals for PRs and introduces new runner behavior (build/install execution + artifact schema changes).

Overview
Adds Layer 3 build/typecheck coverage to the eval suite by introducing a runBuild runner that executes pnpm install + the scenario buildCommand, captures/redacts stderr, and records a BuildResult on the scenario Artifact.

Extends golden replays to optionally load golden/build-result.json, wires the new L3-build-passes scorer into the scorer stack, and adds tests (including negative controls) to assert the suite fails on known regressions.

Introduces a new PR-gate GitHub Actions workflow (evals-pr-scenarios.yml) that path-filters relevant changes and runs vitest plus a Ring 1 scenario matrix, uploading eval reports as artifacts.

^{Reviewed by Cursor Bugbot for commit 1915919. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Live run deletes working dir before scorers read it
- Removed the premature rmSync() call in runLive() that deleted workingDir before scorers could read files from it, allowing all file-content-based scorers to function correctly during live eval runs.

Or push these changes by commenting:

@cursor push cb9a21180b

Preview (cb9a21180b)

diff --git a/evals/runner/invoke-wizard.ts b/evals/runner/invoke-wizard.ts
--- a/evals/runner/invoke-wizard.ts
+++ b/evals/runner/invoke-wizard.ts
@@ -246,14 +246,11 @@
   const parsed = parseStream(stdout);
   const fsSnapshot = captureFsSnapshot(pristineDir, workingDir);
 
-  // Best-effort teardown. If this fails the next run will still
-  // start by removing `workingDir` (mintWorkingDir is keyed on runId,
-  // which is fresh per call, so a leak only costs tmpdir space).
-  try {
-    rmSync(workingDir, { recursive: true });
-  } catch {
-    // ignore — diagnostics will surface a stale dir
-  }
+  // Note: workingDir is NOT deleted here because scorers need to read
+  // files from it. The directory is under os.tmpdir() and keyed by
+  // runId (fresh per call), so multiple runs won't collide. The caller
+  // (run-eval.ts) can clean it up after scoring, or let the OS
+  // eventually reclaim tmpdir space.
 
   return {
     artifact: {

_{You can send follow-ups to the cloud agent here.}

kaiapeacock-eng · 2026-05-08T16:16:50Z

Rebased onto main (force-push). The branch had drifted significantly behind — main got #601's call-site eval framework which conflicted on .github/workflows/evals-pr.yml.

Approach: cherry-picked onto pr-567's already-rebased tip + my one new commit, since #570's older commits are already on main via prior merges (cherry-picking them would have produced empty patches).

Resolved one conflict on .github/workflows/evals-pr.yml — main has a call-site gate workflow under that filename. Renamed this PR's scenario gate to .github/workflows/evals-pr-scenarios.yml so both gates coexist with non-clashing concurrency groups (evals-pr- vs evals-pr-scenarios-).

Stack now (6 commits ahead of main, all yours + kelsonpw + me):

56b04e2c — useDetection toggle (was 5a5b41d)
6fd8c77a — per-run working dirs (was 3340ae8)
12a3b361 — WIZARD_BIN (was 4c50981)
ef0b9ad7 — defer runLive cleanup, kelsonpw (was b8e112b)
3a287f73 — init-call overlap + criterion=0 (was 9f2fc3f)
411d2d33 — Layer 3 build runner + scenario PR gate (was c360d99)

Verification: pnpm exec tsc --noEmit clean, eval tests 68/68 passing.

Note: this PR now contains the same first 5 commits as #567. Once #567 merges, those will collapse out automatically on the next rebase.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Workflow path filter references wrong filename for self-trigger
- Changed the path filter from 'evals-pr.yml' (the separate call-site gate workflow) to 'evals-pr-scenarios.yml' (the correct self-reference) so changes to this workflow file trigger CI validation.

Or push these changes by commenting:

@cursor push 584917877b

Preview (584917877b)

diff --git a/.github/workflows/evals-pr-scenarios.yml b/.github/workflows/evals-pr-scenarios.yml
--- a/.github/workflows/evals-pr-scenarios.yml
+++ b/.github/workflows/evals-pr-scenarios.yml
@@ -20,7 +20,7 @@
       - 'skills/integration/**'
       - 'evals/**'
       - 'package.json'
-      - '.github/workflows/evals-pr.yml'
+      - '.github/workflows/evals-pr-scenarios.yml'
 
 # Run-once-per-ref semantics: a new push to the PR cancels the prior
 # run. PR-gate evals are deterministic against golden replays today,

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit 411d2d3. Configure here.}

cursor · 2026-05-08T16:24:48Z

+      - 'skills/integration/**'
+      - 'evals/**'
+      - 'package.json'
+      - '.github/workflows/evals-pr.yml'


Workflow path filter references wrong filename for self-trigger

Low Severity

The workflow file evals-pr-scenarios.yml includes '.github/workflows/evals-pr.yml' in its path triggers, but that's a separate pre-existing workflow (the call-site gate). The intended self-reference to enable CI validation of changes to this workflow should be '.github/workflows/evals-pr-scenarios.yml'. The PR description even refers to the new file as evals-pr.yml, confirming the developer thought that was the filename — the rename to avoid collision with the existing file wasn't reflected in this path entry.

^{Reviewed by Cursor Bugbot for commit 411d2d3. Configure here.}

Vanilla React + Vite SPA — the "plain SPA" cohort with no meta-framework router. Tests VITE_ env prefix and the standard src/main.tsx entry. Distinct from RR7-data: no router config, init lands directly in main.tsx alongside createRoot. Pristine: minimal create-vite-style React+TS starter (package.json, tsconfig, vite.config, index.html, src/main.tsx, src/App.tsx). Golden adds @amplitude/unified, .env.local with VITE_AMPLITUDE_API_KEY, init() in main.tsx, and a track('Sign Up Submitted') call in App.tsx. Verified: pnpm evals:run react-vite/vanilla → 50/50, contractOk, hardFailed=false. All Layer 0 + Layer 1 scorers pass. Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the chain once #560 → #567 → #570 land. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 5 of Week 2 — proof that the eval suite actually catches the regressions it was built to catch. Each test forks the existing nextjs-app-router/vanilla golden into a tmpdir, applies a deliberate regression, and asserts the full scorer stack reports the failure shape we expect. Three controls land here: 1. Wrong SDK family — substitute @amplitude/analytics-browser for @amplitude/unified in the golden's package.json. L0-correct-sdk-package hard-fails with a detail naming the wrong family. 2. Hardcoded API key — plant the eval-only API key literal in the entry file. L0-no-hardcoded-key hard-fails. 3. Confirmed event without a track() call — strip a track('Sign In') while keeping it in the proposed plan. L1-confirmed-events-tracked fails with a detail naming 'Sign In'. These run cheaply (sub-second per test) so they live alongside the success-path runner.test.ts. If a future scorer change makes a negative control unexpectedly pass, the suite caught a regression in itself — that's the point. Verified: 34/34 tests across 6 files; lint clean. Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the chain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kelsonpw

LGTM. Layer 3 build runner is well scoped — pnpm install + scenario buildCommand, redacted stderr tail, separate exitCodes for install vs build (criterion detail distinguishes lockfile drift from a wizard regression). The skip-passes-when-no-buildResult posture is the right default for a feature that lands before any scenario actually carries a build-result.json fixture.\n\nThe PR-gate workflow is sensibly path-filtered to wizard agent surface + skills + evals, with concurrency cancelling stale runs and Ring 1 matrix in parallel. Bugbot's low-severity self-trigger filename note (evals-pr.yml vs evals-pr-scenarios.yml at line 23 of the workflow) is worth fixing in a follow-up so changes to this workflow actually run it.\n\nNode 24 unit test failure (DataIngestionCheckScreen tracking-plan MCP polling) is a pre-existing flake unrelated to this PR — same test failed on 575/577 in the same window. The eval scorers + Ring 1 gate are green. Unblocks the scenario-fixture chain (#575 / #576 / #571).

Vanilla React + Vite SPA — the "plain SPA" cohort with no meta-framework router. Tests VITE_ env prefix and the standard src/main.tsx entry. Distinct from RR7-data: no router config, init lands directly in main.tsx alongside createRoot. Pristine: minimal create-vite-style React+TS starter (package.json, tsconfig, vite.config, index.html, src/main.tsx, src/App.tsx). Golden adds @amplitude/unified, .env.local with VITE_AMPLITUDE_API_KEY, init() in main.tsx, and a track('Sign Up Submitted') call in App.tsx. Verified: pnpm evals:run react-vite/vanilla → 50/50, contractOk, hardFailed=false. All Layer 0 + Layer 1 scorers pass. Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the chain once #560 → #567 → #570 land. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 5 of Week 2 — proof that the eval suite actually catches the regressions it was built to catch. Each test forks the existing nextjs-app-router/vanilla golden into a tmpdir, applies a deliberate regression, and asserts the full scorer stack reports the failure shape we expect. Three controls land here: 1. Wrong SDK family — substitute @amplitude/analytics-browser for @amplitude/unified in the golden's package.json. L0-correct-sdk-package hard-fails with a detail naming the wrong family. 2. Hardcoded API key — plant the eval-only API key literal in the entry file. L0-no-hardcoded-key hard-fails. 3. Confirmed event without a track() call — strip a track('Sign In') while keeping it in the proposed plan. L1-confirmed-events-tracked fails with a detail naming 'Sign In'. These run cheaply (sub-second per test) so they live alongside the success-path runner.test.ts. If a future scorer change makes a negative control unexpectedly pass, the suite caught a regression in itself — that's the point. Verified: 34/34 tests across 6 files; lint clean. Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the chain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 4 of Week 2 — Layer 3 grades whether the integration still compiles after the wizard runs, and a workflow gates PRs on Layers 0-3 with a Ring 1 matrix. * runner/build.ts — runBuild() spawns `pnpm install` then the scenario.buildCommand inside the working tree, captures stderr, redacts via the same observability/redact helper used for stdout. Throws only on timeouts/spawn errors; non-zero exit codes pass through to the BuildResult so the scorer can grade. * scorers/layer3-build/build-passes.ts — heavy 10pts. Skip-passes with weight 0 when no buildResult is on the artifact (current state on the existing golden); passes with weight 10 when the build exits 0; fails with the redacted stderr tail otherwise. Distinguishes install failure (likely lockfile drift) from build failure (likely a wizard regression) in the detail. * runner/types.ts — new BuildResult type; Artifact gains optional buildResult field. * runner/invoke-wizard.ts — runReplay loads optional golden/build-result.json so fixtures can pin a recorded outcome. runLive wiring is deferred — live runs require the eval-only Amplitude project (open decision #2) and pnpm install costs are not worth incurring on every dev-checkout run today. * runner/score.ts — orchestrator's downstream loop already covered Layer 3 via the layer > 0 filter; no change to the loop, just the scorer registration. * .github/workflows/evals-pr.yml — pull_request workflow, paths-filtered to the wizard agent surface + skills + evals, runs vitest on evals/ and a Ring 1 scenario matrix in parallel. Concurrency cancels stale runs. Artifacts uploaded for triage. Actions pinned to commit SHAs per the repo's other workflows. Verified: * pnpm exec vitest run evals/ — 31/31 across 5 files * pnpm evals:run nextjs-app-router/vanilla — 50/50, contractOk * pnpm lint clean Stacked on PR #567 (ba-104-runner-infra); will rebase onto main after #567 lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 5 of Week 2 — proof that the eval suite actually catches the regressions it was built to catch. Each test forks the existing nextjs-app-router/vanilla golden into a tmpdir, applies a deliberate regression, and asserts the full scorer stack reports the failure shape we expect. Three controls land here: 1. Wrong SDK family — substitute @amplitude/analytics-browser for @amplitude/unified in the golden's package.json. L0-correct-sdk-package hard-fails with a detail naming the wrong family. 2. Hardcoded API key — plant the eval-only API key literal in the entry file. L0-no-hardcoded-key hard-fails. 3. Confirmed event without a track() call — strip a track('Sign In') while keeping it in the proposed plan. L1-confirmed-events-tracked fails with a detail naming 'Sign In'. These run cheaply (sub-second per test) so they live alongside the success-path runner.test.ts. If a future scorer change makes a negative control unexpectedly pass, the suite caught a regression in itself — that's the point. Verified: 34/34 tests across 6 files; lint clean. Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the chain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…#577) Phase 5 of Week 2 — proof that the eval suite actually catches the regressions it was built to catch. Each test forks the existing nextjs-app-router/vanilla golden into a tmpdir, applies a deliberate regression, and asserts the full scorer stack reports the failure shape we expect. Three controls land here: 1. Wrong SDK family — substitute @amplitude/analytics-browser for @amplitude/unified in the golden's package.json. L0-correct-sdk-package hard-fails with a detail naming the wrong family. 2. Hardcoded API key — plant the eval-only API key literal in the entry file. L0-no-hardcoded-key hard-fails. 3. Confirmed event without a track() call — strip a track('Sign In') while keeping it in the proposed plan. L1-confirmed-events-tracked fails with a detail naming 'Sign In'. These run cheaply (sub-second per test) so they live alongside the success-path runner.test.ts. If a future scorer change makes a negative control unexpectedly pass, the suite caught a regression in itself — that's the point. Verified: 34/34 tests across 6 files; lint clean. Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the chain. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kaiapeacock-eng requested a review from a team as a code owner May 6, 2026 20:16

cursor Bot reviewed May 6, 2026

View reviewed changes

Comment thread evals/runner/invoke-wizard.ts Outdated

This was referenced May 6, 2026

feat(evals): Ring 1 scenario — react-router-7/data #575

Closed

feat(evals): Ring 1 scenario — react-vite/vanilla #576

Closed

test(evals): negative-control regressions catch the cases they should #577

Merged

kaiapeacock-eng force-pushed the ba-104-layer3-and-ci branch from c360d99 to 411d2d3 Compare May 8, 2026 16:16

cursor Bot reviewed May 8, 2026

View reviewed changes

kaiapeacock-eng force-pushed the ba-104-layer3-and-ci branch from 411d2d3 to d06d185 Compare May 8, 2026 16:26

kelsonpw approved these changes May 8, 2026

View reviewed changes

kelsonpw mentioned this pull request May 8, 2026

feat(evals): Layer 2 static scorers (criteria 2, 3, 7, 11, 12, 15) #566

Merged

5 tasks

kaiapeacock-eng force-pushed the ba-104-layer3-and-ci branch from d06d185 to 624a524 Compare May 8, 2026 17:36

kaiapeacock-eng force-pushed the ba-104-layer3-and-ci branch from 624a524 to 6ee0853 Compare May 8, 2026 18:33

kaiapeacock-eng enabled auto-merge (squash) May 8, 2026 18:33

kaiapeacock-eng merged commit 3b0eefd into main May 8, 2026
13 checks passed

amplitude-release-bot Bot mentioned this pull request May 8, 2026

chore(main): release wizard 1.17.0 #612

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(evals): Layer 3 build runner + PR-gate CI workflow#570

feat(evals): Layer 3 build runner + PR-gate CI workflow#570
kaiapeacock-eng merged 2 commits into
mainfrom
ba-104-layer3-and-ci

kaiapeacock-eng commented May 6, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

kaiapeacock-eng commented May 8, 2026

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

cursor Bot May 8, 2026

Uh oh!

kelsonpw left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kaiapeacock-eng commented May 6, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kaiapeacock-eng commented May 8, 2026

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 8, 2026

Choose a reason for hiding this comment

Workflow path filter references wrong filename for self-trigger

Uh oh!

kelsonpw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaiapeacock-eng commented May 6, 2026 •

edited by cursor Bot

Loading

cursor Bot left a comment •

edited

Loading

cursor Bot left a comment •

edited

Loading