test(evals): negative-control regressions catch the cases they should#577
Conversation
c360d99 to
411d2d3
Compare
411d2d3 to
d06d185
Compare
d57bf98 to
1961726
Compare
kelsonpw
left a comment
There was a problem hiding this comment.
LGTM. Three sharp negative-control tests (wrong SDK family, hardcoded API key, missing track() for a confirmed event) all asserting on the failure shape — name in the detail, hard-fail on the L0 ones, soft-fail with event name on the L1. Self-validating in the right way: if a future scorer change makes one of these unexpectedly pass, the suite caught a regression in itself.\n\nNote: Node 22/24 unit-test failure (DataIngestionCheckScreen tracking-plan polling test) is a known flake unrelated to this PR — same test failed on 570 and 575 in the same window. Eval scorer + Ring 1 gate are green, which is what matters here. Stacked on #570; will merge cleanly once the chain lands.
d06d185 to
624a524
Compare
1961726 to
1c178b2
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Workflow paths filter doesn't include itself for self-validation
- Added '.github/workflows/evals-pr-scenarios.yml' to its own paths trigger so changes to the workflow file will trigger CI validation.
Or push these changes by commenting:
@cursor push 002ac15775
Preview (002ac15775)
diff --git a/.github/workflows/evals-pr-scenarios.yml b/.github/workflows/evals-pr-scenarios.yml
--- a/.github/workflows/evals-pr-scenarios.yml
+++ b/.github/workflows/evals-pr-scenarios.yml
@@ -21,6 +21,7 @@
- 'evals/**'
- 'package.json'
- '.github/workflows/evals-pr.yml'
+ - '.github/workflows/evals-pr-scenarios.yml'
# Run-once-per-ref semantics: a new push to the PR cancels the prior
# run. PR-gate evals are deterministic against golden replays today,You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit 1c178b2. Configure here.
624a524 to
6ee0853
Compare
Phase 5 of Week 2 — proof that the eval suite actually catches the
regressions it was built to catch. Each test forks the existing
nextjs-app-router/vanilla golden into a tmpdir, applies a deliberate
regression, and asserts the full scorer stack reports the failure
shape we expect.
Three controls land here:
1. Wrong SDK family — substitute @amplitude/analytics-browser for
@amplitude/unified in the golden's package.json. L0-correct-sdk-package
hard-fails with a detail naming the wrong family.
2. Hardcoded API key — plant the eval-only API key literal in the
entry file. L0-no-hardcoded-key hard-fails.
3. Confirmed event without a track() call — strip a track('Sign In')
while keeping it in the proposed plan. L1-confirmed-events-tracked
fails with a detail naming 'Sign In'.
These run cheaply (sub-second per test) so they live alongside the
success-path runner.test.ts. If a future scorer change makes a
negative control unexpectedly pass, the suite caught a regression in
itself — that's the point.
Verified: 34/34 tests across 6 files; lint clean.
Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the
chain.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1c178b2 to
4aa30e6
Compare
* feat(evals): Layer 3 build runner + PR-gate CI workflow
Phase 4 of Week 2 — Layer 3 grades whether the integration still
compiles after the wizard runs, and a workflow gates PRs on Layers
0-3 with a Ring 1 matrix.
* runner/build.ts — runBuild() spawns `pnpm install` then the
scenario.buildCommand inside the working tree, captures stderr,
redacts via the same observability/redact helper used for stdout.
Throws only on timeouts/spawn errors; non-zero exit codes pass
through to the BuildResult so the scorer can grade.
* scorers/layer3-build/build-passes.ts — heavy 10pts. Skip-passes
with weight 0 when no buildResult is on the artifact (current
state on the existing golden); passes with weight 10 when the
build exits 0; fails with the redacted stderr tail otherwise.
Distinguishes install failure (likely lockfile drift) from
build failure (likely a wizard regression) in the detail.
* runner/types.ts — new BuildResult type; Artifact gains optional
buildResult field.
* runner/invoke-wizard.ts — runReplay loads optional
golden/build-result.json so fixtures can pin a recorded outcome.
runLive wiring is deferred — live runs require the eval-only
Amplitude project (open decision #2) and pnpm install costs are
not worth incurring on every dev-checkout run today.
* runner/score.ts — orchestrator's downstream loop already covered
Layer 3 via the layer > 0 filter; no change to the loop, just
the scorer registration.
* .github/workflows/evals-pr.yml — pull_request workflow,
paths-filtered to the wizard agent surface + skills + evals,
runs vitest on evals/ and a Ring 1 scenario matrix in parallel.
Concurrency cancels stale runs. Artifacts uploaded for triage.
Actions pinned to commit SHAs per the repo's other workflows.
Verified:
* pnpm exec vitest run evals/ — 31/31 across 5 files
* pnpm evals:run nextjs-app-router/vanilla — 50/50, contractOk
* pnpm lint clean
Stacked on PR #567 (ba-104-runner-infra); will rebase onto main
after #567 lands.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(evals): negative-control regressions catch the cases they should (#577)
Phase 5 of Week 2 — proof that the eval suite actually catches the
regressions it was built to catch. Each test forks the existing
nextjs-app-router/vanilla golden into a tmpdir, applies a deliberate
regression, and asserts the full scorer stack reports the failure
shape we expect.
Three controls land here:
1. Wrong SDK family — substitute @amplitude/analytics-browser for
@amplitude/unified in the golden's package.json. L0-correct-sdk-package
hard-fails with a detail naming the wrong family.
2. Hardcoded API key — plant the eval-only API key literal in the
entry file. L0-no-hardcoded-key hard-fails.
3. Confirmed event without a track() call — strip a track('Sign In')
while keeping it in the proposed plan. L1-confirmed-events-tracked
fails with a detail naming 'Sign In'.
These run cheaply (sub-second per test) so they live alongside the
success-path runner.test.ts. If a future scorer change makes a
negative control unexpectedly pass, the suite caught a regression in
itself — that's the point.
Verified: 34/34 tests across 6 files; lint clean.
Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the
chain.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>


Summary
Phase 5 of Week 2 — proof that the eval suite actually catches the regressions it was built to catch. Each test forks the existing nextjs-app-router/vanilla golden into a tmpdir, applies a deliberate regression, and asserts the full scorer stack reports the failure shape we expect.
Three controls:
@amplitude/analytics-browserfor@amplitude/unifiedin the golden'spackage.json.L0-correct-sdk-packagehard-fails with a detail naming the wrong family.L0-no-hardcoded-keyhard-fails.track()call — strip atrack('Sign In')while keeping it in the proposed plan.L1-confirmed-events-trackedfails with a detail naming'Sign In'.If a future scorer change makes a negative control unexpectedly pass, the suite caught a regression in itself — that's the point.
Test plan
pnpm exec vitest run evals/→ 34/34 across 6 filespnpm lintcleanStacked: base is
ba-104-layer3-and-ci(#570). Full chain: this → #570 → #567 → #560 → main.🤖 Generated with Claude Code
Note
Low Risk
Low risk: adds new Vitest coverage only, exercising the existing replay/score pipeline without changing production scoring logic.
Overview
Adds a new Vitest negative-control suite that clones the
nextjs-app-router/vanillagolden scenario into a temp directory, injects intentional regressions, and runs the fullrunReplay+scorestack to assert the expected failure modes.The tests verify three canary cases: wrong SDK family triggers
L0-correct-sdk-packagehard-fail, hardcoded API key triggersL0-no-hardcoded-keyhard-fail, and missingtrack()for a confirmed event causesL1-confirmed-events-trackedto fail with the missing event called out.Reviewed by Cursor Bugbot for commit 4aa30e6. Bugbot is set up for automated code reviews on this repo. Configure here.