Skip to content

test(evals): negative-control regressions catch the cases they should#577

Merged
kaiapeacock-eng merged 1 commit into
ba-104-layer3-and-cifrom
ba-104-negative-controls
May 8, 2026
Merged

test(evals): negative-control regressions catch the cases they should#577
kaiapeacock-eng merged 1 commit into
ba-104-layer3-and-cifrom
ba-104-negative-controls

Conversation

@kaiapeacock-eng

@kaiapeacock-eng kaiapeacock-eng commented May 6, 2026

Copy link
Copy Markdown
Collaborator

Summary

Phase 5 of Week 2 — proof that the eval suite actually catches the regressions it was built to catch. Each test forks the existing nextjs-app-router/vanilla golden into a tmpdir, applies a deliberate regression, and asserts the full scorer stack reports the failure shape we expect.

Three controls:

  1. Wrong SDK family — substitute @amplitude/analytics-browser for @amplitude/unified in the golden's package.json. L0-correct-sdk-package hard-fails with a detail naming the wrong family.
  2. Hardcoded API key — plant the eval-only API key literal in the entry file. L0-no-hardcoded-key hard-fails.
  3. Confirmed event without a track() call — strip a track('Sign In') while keeping it in the proposed plan. L1-confirmed-events-tracked fails with a detail naming 'Sign In'.

If a future scorer change makes a negative control unexpectedly pass, the suite caught a regression in itself — that's the point.

Test plan

  • pnpm exec vitest run evals/ → 34/34 across 6 files
  • pnpm lint clean
  • All three negative controls fire as expected

Stacked: base is ba-104-layer3-and-ci (#570). Full chain: this → #570#567#560 → main.

🤖 Generated with Claude Code


Note

Low Risk
Low risk: adds new Vitest coverage only, exercising the existing replay/score pipeline without changing production scoring logic.

Overview
Adds a new Vitest negative-control suite that clones the nextjs-app-router/vanilla golden scenario into a temp directory, injects intentional regressions, and runs the full runReplay + score stack to assert the expected failure modes.

The tests verify three canary cases: wrong SDK family triggers L0-correct-sdk-package hard-fail, hardcoded API key triggers L0-no-hardcoded-key hard-fail, and missing track() for a confirmed event causes L1-confirmed-events-tracked to fail with the missing event called out.

Reviewed by Cursor Bugbot for commit 4aa30e6. Bugbot is set up for automated code reviews on this repo. Configure here.

@kaiapeacock-eng kaiapeacock-eng force-pushed the ba-104-layer3-and-ci branch from c360d99 to 411d2d3 Compare May 8, 2026 16:16
@kaiapeacock-eng kaiapeacock-eng requested a review from a team as a code owner May 8, 2026 16:16
@kaiapeacock-eng kaiapeacock-eng force-pushed the ba-104-layer3-and-ci branch from 411d2d3 to d06d185 Compare May 8, 2026 16:26
@kaiapeacock-eng kaiapeacock-eng force-pushed the ba-104-negative-controls branch from d57bf98 to 1961726 Compare May 8, 2026 16:27

@kelsonpw kelsonpw left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Three sharp negative-control tests (wrong SDK family, hardcoded API key, missing track() for a confirmed event) all asserting on the failure shape — name in the detail, hard-fail on the L0 ones, soft-fail with event name on the L1. Self-validating in the right way: if a future scorer change makes one of these unexpectedly pass, the suite caught a regression in itself.\n\nNote: Node 22/24 unit-test failure (DataIngestionCheckScreen tracking-plan polling test) is a known flake unrelated to this PR — same test failed on 570 and 575 in the same window. Eval scorer + Ring 1 gate are green, which is what matters here. Stacked on #570; will merge cleanly once the chain lands.

@kaiapeacock-eng kaiapeacock-eng force-pushed the ba-104-layer3-and-ci branch from d06d185 to 624a524 Compare May 8, 2026 17:36
@kaiapeacock-eng kaiapeacock-eng force-pushed the ba-104-negative-controls branch from 1961726 to 1c178b2 Compare May 8, 2026 17:36

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Workflow paths filter doesn't include itself for self-validation
    • Added '.github/workflows/evals-pr-scenarios.yml' to its own paths trigger so changes to the workflow file will trigger CI validation.

Create PR

Or push these changes by commenting:

@cursor push 002ac15775
Preview (002ac15775)
diff --git a/.github/workflows/evals-pr-scenarios.yml b/.github/workflows/evals-pr-scenarios.yml
--- a/.github/workflows/evals-pr-scenarios.yml
+++ b/.github/workflows/evals-pr-scenarios.yml
@@ -21,6 +21,7 @@
       - 'evals/**'
       - 'package.json'
       - '.github/workflows/evals-pr.yml'
+      - '.github/workflows/evals-pr-scenarios.yml'
 
 # Run-once-per-ref semantics: a new push to the PR cancels the prior
 # run. PR-gate evals are deterministic against golden replays today,

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit 1c178b2. Configure here.

Comment thread .github/workflows/evals-pr-scenarios.yml
@kaiapeacock-eng kaiapeacock-eng force-pushed the ba-104-layer3-and-ci branch from 624a524 to 6ee0853 Compare May 8, 2026 18:33
Phase 5 of Week 2 — proof that the eval suite actually catches the
regressions it was built to catch. Each test forks the existing
nextjs-app-router/vanilla golden into a tmpdir, applies a deliberate
regression, and asserts the full scorer stack reports the failure
shape we expect.

Three controls land here:

  1. Wrong SDK family — substitute @amplitude/analytics-browser for
     @amplitude/unified in the golden's package.json. L0-correct-sdk-package
     hard-fails with a detail naming the wrong family.
  2. Hardcoded API key — plant the eval-only API key literal in the
     entry file. L0-no-hardcoded-key hard-fails.
  3. Confirmed event without a track() call — strip a track('Sign In')
     while keeping it in the proposed plan. L1-confirmed-events-tracked
     fails with a detail naming 'Sign In'.

These run cheaply (sub-second per test) so they live alongside the
success-path runner.test.ts. If a future scorer change makes a
negative control unexpectedly pass, the suite caught a regression in
itself — that's the point.

Verified: 34/34 tests across 6 files; lint clean.

Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the
chain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kaiapeacock-eng kaiapeacock-eng force-pushed the ba-104-negative-controls branch from 1c178b2 to 4aa30e6 Compare May 8, 2026 18:34
@kaiapeacock-eng kaiapeacock-eng merged commit 1915919 into ba-104-layer3-and-ci May 8, 2026
8 checks passed
kaiapeacock-eng added a commit that referenced this pull request May 8, 2026
* feat(evals): Layer 3 build runner + PR-gate CI workflow

Phase 4 of Week 2 — Layer 3 grades whether the integration still
compiles after the wizard runs, and a workflow gates PRs on Layers
0-3 with a Ring 1 matrix.

  * runner/build.ts — runBuild() spawns `pnpm install` then the
    scenario.buildCommand inside the working tree, captures stderr,
    redacts via the same observability/redact helper used for stdout.
    Throws only on timeouts/spawn errors; non-zero exit codes pass
    through to the BuildResult so the scorer can grade.
  * scorers/layer3-build/build-passes.ts — heavy 10pts. Skip-passes
    with weight 0 when no buildResult is on the artifact (current
    state on the existing golden); passes with weight 10 when the
    build exits 0; fails with the redacted stderr tail otherwise.
    Distinguishes install failure (likely lockfile drift) from
    build failure (likely a wizard regression) in the detail.
  * runner/types.ts — new BuildResult type; Artifact gains optional
    buildResult field.
  * runner/invoke-wizard.ts — runReplay loads optional
    golden/build-result.json so fixtures can pin a recorded outcome.
    runLive wiring is deferred — live runs require the eval-only
    Amplitude project (open decision #2) and pnpm install costs are
    not worth incurring on every dev-checkout run today.
  * runner/score.ts — orchestrator's downstream loop already covered
    Layer 3 via the layer > 0 filter; no change to the loop, just
    the scorer registration.
  * .github/workflows/evals-pr.yml — pull_request workflow,
    paths-filtered to the wizard agent surface + skills + evals,
    runs vitest on evals/ and a Ring 1 scenario matrix in parallel.
    Concurrency cancels stale runs. Artifacts uploaded for triage.
    Actions pinned to commit SHAs per the repo's other workflows.

Verified:
  * pnpm exec vitest run evals/ — 31/31 across 5 files
  * pnpm evals:run nextjs-app-router/vanilla — 50/50, contractOk
  * pnpm lint clean

Stacked on PR #567 (ba-104-runner-infra); will rebase onto main
after #567 lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(evals): negative-control regressions catch the cases they should (#577)

Phase 5 of Week 2 — proof that the eval suite actually catches the
regressions it was built to catch. Each test forks the existing
nextjs-app-router/vanilla golden into a tmpdir, applies a deliberate
regression, and asserts the full scorer stack reports the failure
shape we expect.

Three controls land here:

  1. Wrong SDK family — substitute @amplitude/analytics-browser for
     @amplitude/unified in the golden's package.json. L0-correct-sdk-package
     hard-fails with a detail naming the wrong family.
  2. Hardcoded API key — plant the eval-only API key literal in the
     entry file. L0-no-hardcoded-key hard-fails.
  3. Confirmed event without a track() call — strip a track('Sign In')
     while keeping it in the proposed plan. L1-confirmed-events-tracked
     fails with a detail naming 'Sign In'.

These run cheaply (sub-second per test) so they live alongside the
success-path runner.test.ts. If a future scorer change makes a
negative control unexpectedly pass, the suite caught a regression in
itself — that's the point.

Verified: 34/34 tests across 6 files; lint clean.

Stacked on PR #570 (ba-104-layer3-and-ci); will rebase down the
chain.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants