Skip to content

feat(evals): Layer 3 build runner + PR-gate CI workflow#570

Merged
kaiapeacock-eng merged 2 commits into
mainfrom
ba-104-layer3-and-ci
May 8, 2026
Merged

feat(evals): Layer 3 build runner + PR-gate CI workflow#570
kaiapeacock-eng merged 2 commits into
mainfrom
ba-104-layer3-and-ci

Conversation

@kaiapeacock-eng

@kaiapeacock-eng kaiapeacock-eng commented May 6, 2026

Copy link
Copy Markdown
Collaborator

Summary

Phase 4 of Week 2 — Layer 3 grades whether the integration still compiles after the wizard runs, plus a workflow that gates PRs on Layers 0–3 with a Ring 1 matrix.

  • runner/build.tsrunBuild() does pnpm install + scenario.buildCommand, captures stderr, redacts via observability/redact. Throws only on timeouts/spawn errors; non-zero exit codes pass through to BuildResult so the scorer grades.
  • scorers/layer3-build/build-passes.ts — heavy 10pts. Skip-passes (weight 0) when no buildResult is on the artifact; passes when exitCode === 0; fails with the redacted stderr tail otherwise. Distinguishes install failure from build failure in the detail.
  • runner/types.ts — new BuildResult type; Artifact.buildResult? field.
  • runner/invoke-wizard.tsrunReplay loads optional golden/build-result.json. runLive wiring deferred (live runs need the eval-only Amplitude project + would incur pnpm install on every dev-checkout run).
  • .github/workflows/evals-pr.ymlpull_request workflow, paths-filtered to the wizard agent surface + skills + evals. Matrix runs Ring 1 scenarios in parallel, vitest covers scorer/runner tests. Concurrency cancels stale runs. Actions pinned to commit SHAs per repo convention.

Test plan

  • pnpm exec vitest run evals/ → 31/31 across 5 files
  • pnpm evals:run nextjs-app-router/vanilla → 50/50, contractOk (Layer 3 skip-passes since the existing golden has no build-result.json)
  • pnpm lint clean
  • Reviewer: when this lands and a scenario brings a golden/build-result.json, the L3 scorer should immediately participate. Verify on the next scenario fixture PR.

Stacked on PR #567 (ba-104-runner-infra). Will rebase onto main after #567 lands.

🤖 Generated with Claude Code


Note

Medium Risk
Adds a new scoring layer and CI gating workflow; while scoped to eval infrastructure, it changes pass/fail signals for PRs and introduces new runner behavior (build/install execution + artifact schema changes).

Overview
Adds Layer 3 build/typecheck coverage to the eval suite by introducing a runBuild runner that executes pnpm install + the scenario buildCommand, captures/redacts stderr, and records a BuildResult on the scenario Artifact.

Extends golden replays to optionally load golden/build-result.json, wires the new L3-build-passes scorer into the scorer stack, and adds tests (including negative controls) to assert the suite fails on known regressions.

Introduces a new PR-gate GitHub Actions workflow (evals-pr-scenarios.yml) that path-filters relevant changes and runs vitest plus a Ring 1 scenario matrix, uploading eval reports as artifacts.

Reviewed by Cursor Bugbot for commit 1915919. Bugbot is set up for automated code reviews on this repo. Configure here.

@kaiapeacock-eng kaiapeacock-eng requested a review from a team as a code owner May 6, 2026 20:16

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Live run deletes working dir before scorers read it
    • Removed the premature rmSync() call in runLive() that deleted workingDir before scorers could read files from it, allowing all file-content-based scorers to function correctly during live eval runs.

Create PR

Or push these changes by commenting:

@cursor push cb9a21180b
Preview (cb9a21180b)
diff --git a/evals/runner/invoke-wizard.ts b/evals/runner/invoke-wizard.ts
--- a/evals/runner/invoke-wizard.ts
+++ b/evals/runner/invoke-wizard.ts
@@ -246,14 +246,11 @@
   const parsed = parseStream(stdout);
   const fsSnapshot = captureFsSnapshot(pristineDir, workingDir);
 
-  // Best-effort teardown. If this fails the next run will still
-  // start by removing `workingDir` (mintWorkingDir is keyed on runId,
-  // which is fresh per call, so a leak only costs tmpdir space).
-  try {
-    rmSync(workingDir, { recursive: true });
-  } catch {
-    // ignore — diagnostics will surface a stale dir
-  }
+  // Note: workingDir is NOT deleted here because scorers need to read
+  // files from it. The directory is under os.tmpdir() and keyed by
+  // runId (fresh per call), so multiple runs won't collide. The caller
+  // (run-eval.ts) can clean it up after scoring, or let the OS
+  // eventually reclaim tmpdir space.
 
   return {
     artifact: {

You can send follow-ups to the cloud agent here.

Comment thread evals/runner/invoke-wizard.ts Outdated
@kaiapeacock-eng

Copy link
Copy Markdown
Collaborator Author

Rebased onto main (force-push). The branch had drifted significantly behind — main got #601's call-site eval framework which conflicted on .github/workflows/evals-pr.yml.

Approach: cherry-picked onto pr-567's already-rebased tip + my one new commit, since #570's older commits are already on main via prior merges (cherry-picking them would have produced empty patches).

Resolved one conflict on .github/workflows/evals-pr.yml — main has a call-site gate workflow under that filename. Renamed this PR's scenario gate to .github/workflows/evals-pr-scenarios.yml so both gates coexist with non-clashing concurrency groups (evals-pr- vs evals-pr-scenarios-).

Stack now (6 commits ahead of main, all yours + kelsonpw + me):

  • 56b04e2c — useDetection toggle (was 5a5b41d)
  • 6fd8c77a — per-run working dirs (was 3340ae8)
  • 12a3b361 — WIZARD_BIN (was 4c50981)
  • ef0b9ad7 — defer runLive cleanup, kelsonpw (was b8e112b)
  • 3a287f73 — init-call overlap + criterion=0 (was 9f2fc3f)
  • 411d2d33 — Layer 3 build runner + scenario PR gate (was c360d99)

Verification: pnpm exec tsc --noEmit clean, eval tests 68/68 passing.

Note: this PR now contains the same first 5 commits as #567. Once #567 merges, those will collapse out automatically on the next rebase.

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Workflow path filter references wrong filename for self-trigger
    • Changed the path filter from 'evals-pr.yml' (the separate call-site gate workflow) to 'evals-pr-scenarios.yml' (the correct self-reference) so changes to this workflow file trigger CI validation.

Create PR

Or push these changes by commenting:

@cursor push 584917877b
Preview (584917877b)
diff --git a/.github/workflows/evals-pr-scenarios.yml b/.github/workflows/evals-pr-scenarios.yml
--- a/.github/workflows/evals-pr-scenarios.yml
+++ b/.github/workflows/evals-pr-scenarios.yml
@@ -20,7 +20,7 @@
       - 'skills/integration/**'
       - 'evals/**'
       - 'package.json'
-      - '.github/workflows/evals-pr.yml'
+      - '.github/workflows/evals-pr-scenarios.yml'
 
 # Run-once-per-ref semantics: a new push to the PR cancels the prior
 # run. PR-gate evals are deterministic against golden replays today,

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit 411d2d3. Configure here.

- 'skills/integration/**'
- 'evals/**'
- 'package.json'
- '.github/workflows/evals-pr.yml'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow path filter references wrong filename for self-trigger

Low Severity

The workflow file evals-pr-scenarios.yml includes '.github/workflows/evals-pr.yml' in its path triggers, but that's a separate pre-existing workflow (the call-site gate). The intended self-reference to enable CI validation of changes to this workflow should be '.github/workflows/evals-pr-scenarios.yml'. The PR description even refers to the new file as evals-pr.yml, confirming the developer thought that was the filename — the rename to avoid collision with the existing file wasn't reflected in this path entry.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 411d2d3. Configure here.

@kaiapeacock-eng kaiapeacock-eng force-pushed the ba-104-layer3-and-ci branch from 411d2d3 to d06d185 Compare May 8, 2026 16:26
kaiapeacock-eng added a commit that referenced this pull request May 8, 2026
Vanilla React + Vite SPA — the "plain SPA" cohort with no
meta-framework router. Tests VITE_ env prefix and the standard
src/main.tsx entry. Distinct from RR7-data: no router config, init
lands directly in main.tsx alongside createRoot.

Pristine: minimal create-vite-style React+TS starter (package.json,
tsconfig, vite.config, index.html, src/main.tsx, src/App.tsx).
Golden adds @amplitude/unified, .env.local with VITE_AMPLITUDE_API_KEY,
init() in main.tsx, and a track('Sign Up Submitted') call in App.tsx.

Verified: pnpm evals:run react-vite/vanilla → 50/50, contractOk,
hardFailed=false. All Layer 0 + Layer 1 scorers pass.

Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the
chain once #560#567#570 land.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kaiapeacock-eng added a commit that referenced this pull request May 8, 2026
Phase 5 of Week 2 — proof that the eval suite actually catches the
regressions it was built to catch. Each test forks the existing
nextjs-app-router/vanilla golden into a tmpdir, applies a deliberate
regression, and asserts the full scorer stack reports the failure
shape we expect.

Three controls land here:

  1. Wrong SDK family — substitute @amplitude/analytics-browser for
     @amplitude/unified in the golden's package.json. L0-correct-sdk-package
     hard-fails with a detail naming the wrong family.
  2. Hardcoded API key — plant the eval-only API key literal in the
     entry file. L0-no-hardcoded-key hard-fails.
  3. Confirmed event without a track() call — strip a track('Sign In')
     while keeping it in the proposed plan. L1-confirmed-events-tracked
     fails with a detail naming 'Sign In'.

These run cheaply (sub-second per test) so they live alongside the
success-path runner.test.ts. If a future scorer change makes a
negative control unexpectedly pass, the suite caught a regression in
itself — that's the point.

Verified: 34/34 tests across 6 files; lint clean.

Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the
chain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@kelsonpw kelsonpw left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Layer 3 build runner is well scoped — pnpm install + scenario buildCommand, redacted stderr tail, separate exitCodes for install vs build (criterion detail distinguishes lockfile drift from a wizard regression). The skip-passes-when-no-buildResult posture is the right default for a feature that lands before any scenario actually carries a build-result.json fixture.\n\nThe PR-gate workflow is sensibly path-filtered to wizard agent surface + skills + evals, with concurrency cancelling stale runs and Ring 1 matrix in parallel. Bugbot's low-severity self-trigger filename note (evals-pr.yml vs evals-pr-scenarios.yml at line 23 of the workflow) is worth fixing in a follow-up so changes to this workflow actually run it.\n\nNode 24 unit test failure (DataIngestionCheckScreen tracking-plan MCP polling) is a pre-existing flake unrelated to this PR — same test failed on 575/577 in the same window. The eval scorers + Ring 1 gate are green. Unblocks the scenario-fixture chain (#575 / #576 / #571).

kaiapeacock-eng added a commit that referenced this pull request May 8, 2026
Vanilla React + Vite SPA — the "plain SPA" cohort with no
meta-framework router. Tests VITE_ env prefix and the standard
src/main.tsx entry. Distinct from RR7-data: no router config, init
lands directly in main.tsx alongside createRoot.

Pristine: minimal create-vite-style React+TS starter (package.json,
tsconfig, vite.config, index.html, src/main.tsx, src/App.tsx).
Golden adds @amplitude/unified, .env.local with VITE_AMPLITUDE_API_KEY,
init() in main.tsx, and a track('Sign Up Submitted') call in App.tsx.

Verified: pnpm evals:run react-vite/vanilla → 50/50, contractOk,
hardFailed=false. All Layer 0 + Layer 1 scorers pass.

Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the
chain once #560#567#570 land.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kaiapeacock-eng kaiapeacock-eng force-pushed the ba-104-layer3-and-ci branch from d06d185 to 624a524 Compare May 8, 2026 17:36
kaiapeacock-eng added a commit that referenced this pull request May 8, 2026
Phase 5 of Week 2 — proof that the eval suite actually catches the
regressions it was built to catch. Each test forks the existing
nextjs-app-router/vanilla golden into a tmpdir, applies a deliberate
regression, and asserts the full scorer stack reports the failure
shape we expect.

Three controls land here:

  1. Wrong SDK family — substitute @amplitude/analytics-browser for
     @amplitude/unified in the golden's package.json. L0-correct-sdk-package
     hard-fails with a detail naming the wrong family.
  2. Hardcoded API key — plant the eval-only API key literal in the
     entry file. L0-no-hardcoded-key hard-fails.
  3. Confirmed event without a track() call — strip a track('Sign In')
     while keeping it in the proposed plan. L1-confirmed-events-tracked
     fails with a detail naming 'Sign In'.

These run cheaply (sub-second per test) so they live alongside the
success-path runner.test.ts. If a future scorer change makes a
negative control unexpectedly pass, the suite caught a regression in
itself — that's the point.

Verified: 34/34 tests across 6 files; lint clean.

Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the
chain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4 of Week 2 — Layer 3 grades whether the integration still
compiles after the wizard runs, and a workflow gates PRs on Layers
0-3 with a Ring 1 matrix.

  * runner/build.ts — runBuild() spawns `pnpm install` then the
    scenario.buildCommand inside the working tree, captures stderr,
    redacts via the same observability/redact helper used for stdout.
    Throws only on timeouts/spawn errors; non-zero exit codes pass
    through to the BuildResult so the scorer can grade.
  * scorers/layer3-build/build-passes.ts — heavy 10pts. Skip-passes
    with weight 0 when no buildResult is on the artifact (current
    state on the existing golden); passes with weight 10 when the
    build exits 0; fails with the redacted stderr tail otherwise.
    Distinguishes install failure (likely lockfile drift) from
    build failure (likely a wizard regression) in the detail.
  * runner/types.ts — new BuildResult type; Artifact gains optional
    buildResult field.
  * runner/invoke-wizard.ts — runReplay loads optional
    golden/build-result.json so fixtures can pin a recorded outcome.
    runLive wiring is deferred — live runs require the eval-only
    Amplitude project (open decision #2) and pnpm install costs are
    not worth incurring on every dev-checkout run today.
  * runner/score.ts — orchestrator's downstream loop already covered
    Layer 3 via the layer > 0 filter; no change to the loop, just
    the scorer registration.
  * .github/workflows/evals-pr.yml — pull_request workflow,
    paths-filtered to the wizard agent surface + skills + evals,
    runs vitest on evals/ and a Ring 1 scenario matrix in parallel.
    Concurrency cancels stale runs. Artifacts uploaded for triage.
    Actions pinned to commit SHAs per the repo's other workflows.

Verified:
  * pnpm exec vitest run evals/ — 31/31 across 5 files
  * pnpm evals:run nextjs-app-router/vanilla — 50/50, contractOk
  * pnpm lint clean

Stacked on PR #567 (ba-104-runner-infra); will rebase onto main
after #567 lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kaiapeacock-eng kaiapeacock-eng force-pushed the ba-104-layer3-and-ci branch from 624a524 to 6ee0853 Compare May 8, 2026 18:33
@kaiapeacock-eng kaiapeacock-eng enabled auto-merge (squash) May 8, 2026 18:33
kaiapeacock-eng added a commit that referenced this pull request May 8, 2026
Phase 5 of Week 2 — proof that the eval suite actually catches the
regressions it was built to catch. Each test forks the existing
nextjs-app-router/vanilla golden into a tmpdir, applies a deliberate
regression, and asserts the full scorer stack reports the failure
shape we expect.

Three controls land here:

  1. Wrong SDK family — substitute @amplitude/analytics-browser for
     @amplitude/unified in the golden's package.json. L0-correct-sdk-package
     hard-fails with a detail naming the wrong family.
  2. Hardcoded API key — plant the eval-only API key literal in the
     entry file. L0-no-hardcoded-key hard-fails.
  3. Confirmed event without a track() call — strip a track('Sign In')
     while keeping it in the proposed plan. L1-confirmed-events-tracked
     fails with a detail naming 'Sign In'.

These run cheaply (sub-second per test) so they live alongside the
success-path runner.test.ts. If a future scorer change makes a
negative control unexpectedly pass, the suite caught a regression in
itself — that's the point.

Verified: 34/34 tests across 6 files; lint clean.

Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the
chain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#577)

Phase 5 of Week 2 — proof that the eval suite actually catches the
regressions it was built to catch. Each test forks the existing
nextjs-app-router/vanilla golden into a tmpdir, applies a deliberate
regression, and asserts the full scorer stack reports the failure
shape we expect.

Three controls land here:

  1. Wrong SDK family — substitute @amplitude/analytics-browser for
     @amplitude/unified in the golden's package.json. L0-correct-sdk-package
     hard-fails with a detail naming the wrong family.
  2. Hardcoded API key — plant the eval-only API key literal in the
     entry file. L0-no-hardcoded-key hard-fails.
  3. Confirmed event without a track() call — strip a track('Sign In')
     while keeping it in the proposed plan. L1-confirmed-events-tracked
     fails with a detail naming 'Sign In'.

These run cheaply (sub-second per test) so they live alongside the
success-path runner.test.ts. If a future scorer change makes a
negative control unexpectedly pass, the suite caught a regression in
itself — that's the point.

Verified: 34/34 tests across 6 files; lint clean.

Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the
chain.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kaiapeacock-eng kaiapeacock-eng merged commit 3b0eefd into main May 8, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants