test(evals): negative-control regressions catch the cases they should by kaiapeacock-eng · Pull Request #577 · amplitude/wizard

kaiapeacock-eng · 2026-05-06T22:47:39Z

Summary

Phase 5 of Week 2 — proof that the eval suite actually catches the regressions it was built to catch. Each test forks the existing nextjs-app-router/vanilla golden into a tmpdir, applies a deliberate regression, and asserts the full scorer stack reports the failure shape we expect.

Three controls:

Wrong SDK family — substitute @amplitude/analytics-browser for @amplitude/unified in the golden's package.json. L0-correct-sdk-package hard-fails with a detail naming the wrong family.
Hardcoded API key — plant the eval-only API key literal in the entry file. L0-no-hardcoded-key hard-fails.
Confirmed event without a track() call — strip a track('Sign In') while keeping it in the proposed plan. L1-confirmed-events-tracked fails with a detail naming 'Sign In'.

If a future scorer change makes a negative control unexpectedly pass, the suite caught a regression in itself — that's the point.

Test plan

pnpm exec vitest run evals/ → 34/34 across 6 files
pnpm lint clean
All three negative controls fire as expected

Stacked: base is ba-104-layer3-and-ci (#570). Full chain: this → #570 → #567 → #560 → main.

🤖 Generated with Claude Code

Note

Low Risk
Low risk: adds new Vitest coverage only, exercising the existing replay/score pipeline without changing production scoring logic.

Overview
Adds a new Vitest negative-control suite that clones the nextjs-app-router/vanilla golden scenario into a temp directory, injects intentional regressions, and runs the full runReplay + score stack to assert the expected failure modes.

The tests verify three canary cases: wrong SDK family triggers L0-correct-sdk-package hard-fail, hardcoded API key triggers L0-no-hardcoded-key hard-fail, and missing track() for a confirmed event causes L1-confirmed-events-tracked to fail with the missing event called out.

^{Reviewed by Cursor Bugbot for commit 4aa30e6. Bugbot is set up for automated code reviews on this repo. Configure here.}

kelsonpw

LGTM. Three sharp negative-control tests (wrong SDK family, hardcoded API key, missing track() for a confirmed event) all asserting on the failure shape — name in the detail, hard-fail on the L0 ones, soft-fail with event name on the L1. Self-validating in the right way: if a future scorer change makes one of these unexpectedly pass, the suite caught a regression in itself.\n\nNote: Node 22/24 unit-test failure (DataIngestionCheckScreen tracking-plan polling test) is a known flake unrelated to this PR — same test failed on 570 and 575 in the same window. Eval scorer + Ring 1 gate are green, which is what matters here. Stacked on #570; will merge cleanly once the chain lands.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Workflow paths filter doesn't include itself for self-validation
- Added '.github/workflows/evals-pr-scenarios.yml' to its own paths trigger so changes to the workflow file will trigger CI validation.

Or push these changes by commenting:

@cursor push 002ac15775

Preview (002ac15775)

diff --git a/.github/workflows/evals-pr-scenarios.yml b/.github/workflows/evals-pr-scenarios.yml
--- a/.github/workflows/evals-pr-scenarios.yml
+++ b/.github/workflows/evals-pr-scenarios.yml
@@ -21,6 +21,7 @@
       - 'evals/**'
       - 'package.json'
       - '.github/workflows/evals-pr.yml'
+      - '.github/workflows/evals-pr-scenarios.yml'
 
 # Run-once-per-ref semantics: a new push to the PR cancels the prior
 # run. PR-gate evals are deterministic against golden replays today,

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit 1c178b2. Configure here.}

Phase 5 of Week 2 — proof that the eval suite actually catches the regressions it was built to catch. Each test forks the existing nextjs-app-router/vanilla golden into a tmpdir, applies a deliberate regression, and asserts the full scorer stack reports the failure shape we expect. Three controls land here: 1. Wrong SDK family — substitute @amplitude/analytics-browser for @amplitude/unified in the golden's package.json. L0-correct-sdk-package hard-fails with a detail naming the wrong family. 2. Hardcoded API key — plant the eval-only API key literal in the entry file. L0-no-hardcoded-key hard-fails. 3. Confirmed event without a track() call — strip a track('Sign In') while keeping it in the proposed plan. L1-confirmed-events-tracked fails with a detail naming 'Sign In'. These run cheaply (sub-second per test) so they live alongside the success-path runner.test.ts. If a future scorer change makes a negative control unexpectedly pass, the suite caught a regression in itself — that's the point. Verified: 34/34 tests across 6 files; lint clean. Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the chain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(evals): Layer 3 build runner + PR-gate CI workflow Phase 4 of Week 2 — Layer 3 grades whether the integration still compiles after the wizard runs, and a workflow gates PRs on Layers 0-3 with a Ring 1 matrix. * runner/build.ts — runBuild() spawns `pnpm install` then the scenario.buildCommand inside the working tree, captures stderr, redacts via the same observability/redact helper used for stdout. Throws only on timeouts/spawn errors; non-zero exit codes pass through to the BuildResult so the scorer can grade. * scorers/layer3-build/build-passes.ts — heavy 10pts. Skip-passes with weight 0 when no buildResult is on the artifact (current state on the existing golden); passes with weight 10 when the build exits 0; fails with the redacted stderr tail otherwise. Distinguishes install failure (likely lockfile drift) from build failure (likely a wizard regression) in the detail. * runner/types.ts — new BuildResult type; Artifact gains optional buildResult field. * runner/invoke-wizard.ts — runReplay loads optional golden/build-result.json so fixtures can pin a recorded outcome. runLive wiring is deferred — live runs require the eval-only Amplitude project (open decision #2) and pnpm install costs are not worth incurring on every dev-checkout run today. * runner/score.ts — orchestrator's downstream loop already covered Layer 3 via the layer > 0 filter; no change to the loop, just the scorer registration. * .github/workflows/evals-pr.yml — pull_request workflow, paths-filtered to the wizard agent surface + skills + evals, runs vitest on evals/ and a Ring 1 scenario matrix in parallel. Concurrency cancels stale runs. Artifacts uploaded for triage. Actions pinned to commit SHAs per the repo's other workflows. Verified: * pnpm exec vitest run evals/ — 31/31 across 5 files * pnpm evals:run nextjs-app-router/vanilla — 50/50, contractOk * pnpm lint clean Stacked on PR #567 (ba-104-runner-infra); will rebase onto main after #567 lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(evals): negative-control regressions catch the cases they should (#577) Phase 5 of Week 2 — proof that the eval suite actually catches the regressions it was built to catch. Each test forks the existing nextjs-app-router/vanilla golden into a tmpdir, applies a deliberate regression, and asserts the full scorer stack reports the failure shape we expect. Three controls land here: 1. Wrong SDK family — substitute @amplitude/analytics-browser for @amplitude/unified in the golden's package.json. L0-correct-sdk-package hard-fails with a detail naming the wrong family. 2. Hardcoded API key — plant the eval-only API key literal in the entry file. L0-no-hardcoded-key hard-fails. 3. Confirmed event without a track() call — strip a track('Sign In') while keeping it in the proposed plan. L1-confirmed-events-tracked fails with a detail naming 'Sign In'. These run cheaply (sub-second per test) so they live alongside the success-path runner.test.ts. If a future scorer change makes a negative control unexpectedly pass, the suite caught a regression in itself — that's the point. Verified: 34/34 tests across 6 files; lint clean. Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the chain. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kaiapeacock-eng force-pushed the ba-104-layer3-and-ci branch from c360d99 to 411d2d3 Compare May 8, 2026 16:16

kaiapeacock-eng requested a review from a team as a code owner May 8, 2026 16:16

kaiapeacock-eng force-pushed the ba-104-layer3-and-ci branch from 411d2d3 to d06d185 Compare May 8, 2026 16:26

kaiapeacock-eng force-pushed the ba-104-negative-controls branch from d57bf98 to 1961726 Compare May 8, 2026 16:27

kelsonpw approved these changes May 8, 2026

View reviewed changes

kaiapeacock-eng force-pushed the ba-104-layer3-and-ci branch from d06d185 to 624a524 Compare May 8, 2026 17:36

kaiapeacock-eng force-pushed the ba-104-negative-controls branch from 1961726 to 1c178b2 Compare May 8, 2026 17:36

cursor Bot reviewed May 8, 2026

View reviewed changes

Comment thread .github/workflows/evals-pr-scenarios.yml

kaiapeacock-eng force-pushed the ba-104-layer3-and-ci branch from 624a524 to 6ee0853 Compare May 8, 2026 18:33

kaiapeacock-eng force-pushed the ba-104-negative-controls branch from 1c178b2 to 4aa30e6 Compare May 8, 2026 18:34

kaiapeacock-eng merged commit 1915919 into ba-104-layer3-and-ci May 8, 2026
8 checks passed

kaiapeacock-eng mentioned this pull request May 8, 2026

feat(evals): finish Phase 1 — Ring 1 complete, Layers 0-6 wired, --no-telemetry #670

Open

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(evals): negative-control regressions catch the cases they should#577

test(evals): negative-control regressions catch the cases they should#577
kaiapeacock-eng merged 1 commit into
ba-104-layer3-and-cifrom
ba-104-negative-controls

kaiapeacock-eng commented May 6, 2026 •

edited by cursor Bot

Loading

Uh oh!

kelsonpw left a comment

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kaiapeacock-eng commented May 6, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

kelsonpw left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaiapeacock-eng commented May 6, 2026 •

edited by cursor Bot

Loading

cursor Bot left a comment •

edited

Loading