Skip to content

feat(evals): Layer 2 static scorers (criteria 2, 3, 7, 11, 12, 15)#566

Merged
kaiapeacock-eng merged 2 commits into
mainfrom
ba-104-layer2-scorers
May 8, 2026
Merged

feat(evals): Layer 2 static scorers (criteria 2, 3, 7, 11, 12, 15)#566
kaiapeacock-eng merged 2 commits into
mainfrom
ba-104-layer2-scorers

Conversation

@kaiapeacock-eng

@kaiapeacock-eng kaiapeacock-eng commented May 6, 2026

Copy link
Copy Markdown
Collaborator

Summary

Six Layer 2 static scorers that catch DevRel-review regressions without needing a build. Stacked on PR #560 (ba-104-add-eval-suite); will rebase onto main after #560 lands.

  • L2-server-client-boundary (criterion 11, heavy 10) — highest signal. Catches a browser SDK imported into an App Router Server Component, server action, or app/api route.
  • L2-server-sdk-usage (criterion 12, heavy 10) — server-only Amplitude usage must use @amplitude/analytics-node.
  • L2-init-options-commented (criterion 7, medium 5) — init() options object must carry at least one comment.
  • L2-version-range (criterion 2, medium 5) — no wildcard / pre-release @amplitude/* versions.
  • L2-no-vendor-additions (criterion 3, medium 5) — diffs working package.json against pristine; fails on non-Amplitude deps the agent added.
  • L2-property-key-naming (criterion 15, soft warn) — flags non-lowercase with spaces property keys.

AST work uses the TypeScript compiler API (already a project dep). score.ts now loops over every layer > 0 generically so future layers slot in without orchestrator edits.

Test plan

  • pnpm exec vitest run evals/ → 29/29 across 4 test files
  • pnpm evals:run nextjs-app-router/vanilla → 85/85 (was 50/50; +35 from new Layer 2 weights)
  • pnpm lint clean
  • Reviewer: spot-check server-client-boundary.ts heuristic — App Router server-only classification is the load-bearing part
  • Reviewer: confirm no-vendor-additions.ts findPristine walk handles both live and golden working-dir layouts

🤖 Generated with Claude Code


Note

Medium Risk
Adds new AST-based scoring rules and changes the scorer runner to execute all layers >0, which can change evaluation outcomes and gating behavior across scenarios. Risk is moderate due to new filesystem/AST parsing heuristics (esp. Next.js server/client classification) affecting pass/fail results.

Overview
Adds Layer 2 “static” scorers to the eval suite (criteria 2, 3, 7, 11, 12, 15), using the TypeScript compiler API to scan touched files for SDK regressions (server/client boundary violations, server SDK usage in server-only routes, init-options requiring comments, Amplitude dependency version-range sanity, and blocking non-vendor dependency additions; plus a soft-warn on event property key naming).

Updates the scoring runner to include these new scorers and to run all downstream layers (layer > 0) generically after Layer 0, so future layers participate without further orchestrator changes, and extends tests with targeted Layer 2 failure-mode cases plus an adjusted integration assertion to account for the added scoring weight.

Reviewed by Cursor Bugbot for commit 292e73f. Bugbot is set up for automated code reviews on this repo. Configure here.

@kaiapeacock-eng kaiapeacock-eng requested a review from a team as a code owner May 6, 2026 19:35

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: Regex doesn't allow hyphens despite comment claiming otherwise
    • Added hyphen to the character class in the ALLOWED_KEY regex to match the documented intent that legacy keys with hyphens are acceptable.
  • ✅ Fixed: Server SDK scorer false-positives on non-Amplitude track calls
    • Changed the skip condition to only check for the absence of Amplitude imports, preventing false positives on files with non-Amplitude track() functions.

Create PR

Or push these changes by commenting:

@cursor push c60bf1c2a1
Preview (c60bf1c2a1)
diff --git a/evals/scorers/layer2-static/property-key-naming.ts b/evals/scorers/layer2-static/property-key-naming.ts
--- a/evals/scorers/layer2-static/property-key-naming.ts
+++ b/evals/scorers/layer2-static/property-key-naming.ts
@@ -25,7 +25,7 @@
 
 // Single-word lowercase or lowercase-with-spaces, no underscores or
 // camelCase. Allows hyphens because some legacy keys use them.
-const ALLOWED_KEY = /^[a-z][a-z0-9 ]*[a-z0-9]$|^[a-z]$/;
+const ALLOWED_KEY = /^[a-z][a-z0-9 -]*[a-z0-9]$|^[a-z]$/;
 const ALLOWED_PREFIXES = ['$']; // Amplitude reserved (e.g. $app_name)
 
 function isOk(key: string): boolean {

diff --git a/evals/scorers/layer2-static/server-sdk-usage.ts b/evals/scorers/layer2-static/server-sdk-usage.ts
--- a/evals/scorers/layer2-static/server-sdk-usage.ts
+++ b/evals/scorers/layer2-static/server-sdk-usage.ts
@@ -117,7 +117,7 @@
           i.specifier.startsWith(`${SERVER_SDK}/`),
       );
 
-      if (!amplitudeImport && !hasTrackCall && !hasInitCall) continue;
+      if (!amplitudeImport) continue;
 
       // Server-only file with Amplitude usage — must be the server
       // SDK family. (Criterion 11 catches the wrong-import case

You can send follow-ups to the cloud agent here.

Comment thread evals/scorers/layer2-static/property-key-naming.ts
Comment thread evals/scorers/layer2-static/server-sdk-usage.ts

@kelsonpw kelsonpw left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

Layer 2 static scorers covering criteria 2, 3, 7, 11, 12, 15. Six AST-based checks built on the TypeScript compiler API (no new heavyweight dep), wired generically into score.ts so future layers slot in without orchestrator edits. Negative-path coverage in evals/scorers/__tests__/layer2-static.test.ts plus the existing golden replay still passes (now 85/85 with Layer 2 weighted in).

As the description notes, the diff stacks on PR #560 — this review focuses on the code introduced by the Layer 2 commit (9b1723d), not the runner/scorers from #560.

Overall the work is well-thought-out: helpful comments explaining heuristic choices, consistent failure-message shape, and a sane separation between hard-fail (criteria 11/12) and warn-only (criterion 15). A handful of small bugs and doc/code drifts are worth fixing before merge.

Findings

Medium

  1. property-key-naming.ts:28 — comment claims hyphens are allowed; regex rejects them. The comment block above ALLOWED_KEY says "Allows hyphens because some legacy keys use them," but the regex /^[a-z][a-z0-9 ]*[a-z0-9]$|^[a-z]$/ only permits lowercase letters, digits, and spaces. A property like 'utm-source' would warn even though the comment promises it wouldn't. Either add - to the character class or fix the comment.

  2. no-vendor-additions.ts:97-103 — code doesn't enforce "same major" guarantee from the docstring. The header comment says "Pre-existing deps unchanged or version-bumped within the same major are fine," but the implementation has if (versionChanged && !isAdded) continue; — that allows any version change, including a 18→19 React major bump. If you genuinely want major-bump enforcement, parse the leading major out of both specs and compare; if you don't, drop "within the same major" from the comment so the next reader doesn't think the rule is tighter than it is.

  3. property-key-naming.ts:61 and server-sdk-usage.ts:112findCallsByName(sf, 'track') matches any function call named track, regardless of whether it's Amplitude's. A user codebase that already has track('lead_qualified', { camelCase: 1 }) from an unrelated analytics library would (a) generate spurious property-key warnings, and (b) trip server-sdk-usage into a hard fail if the file is server-only and doesn't import the node SDK. The fix is to gate the findCallsByName call on "this file imports an Amplitude package" — you already collect imports in both scorers; just check importsAnyAmplitude(...) first and skip the file otherwise. Without that gate, criterion 12 has a real false-positive surface as soon as a Ring 2/3 scenario uses a non-pristine baseline.

  4. property-key-naming.ts — only flags track(name, { … }) direct-call form; misses amplitude.track(...). findCallsByName requires ts.isIdentifier(node.expression), so any member-expression call (the typical SDK shape on instance/namespace imports) passes through unchecked. For the App Router scenario where folks import * as amplitude from '@amplitude/unified' or use client.track(...), criterion 15 would never fire. Acceptable for a soft-warn now, but worth a TODO/comment so the gap is documented rather than implicit.

Low / nits

  1. server-client-boundary.ts:75-77 and server-sdk-usage.ts:54-58 — duplicated expectedInitFile matching logic. Both scorers do path !== scenario.expectedInitFile && !path.endsWith('/' + scenario.expectedInitFile). With init-options-commented.ts:62-63 and import-present.ts:46-49 doing the same matching shape, there are now four implementations. Worth extracting matchInitFile(touched, scenario) into _ast-helpers.ts or runner/types.ts so a future change (e.g. case-insensitive on Windows fixtures) stays in one place. Not blocking.

  2. server-sdk-usage.ts:120 — gating logic. if (!amplitudeImport && !hasTrackCall && !hasInitCall) continue; lets a server-only file with only init() (no track, no import) reach the failure return. That's fine on the happy path but combined with finding (3), in a real customer codebase this could spuriously fire when there's a function literally named init for unrelated reasons. Same fix as (3): require amplitudeImport to be truthy, drop the hasTrackCall || hasInitCall arm.

  3. runner.test.ts:59-63 — comment drift. Comment says "the four Layer 1 weighted scorers" then enumerates seven of them, totalling 50. With Layer 2 in the stack the maxScore is now 85 per the PR description, but the assertion is still toBeGreaterThanOrEqual(45). The floor is fine, but tighten the comment (s/four/seven/) and consider asserting report.maxScore ≥ 80 to catch a regression where a Layer 2 scorer silently fails to register.

  4. _ast-helpers.ts:45isTsx detection only checks .tsx/.jsx. A .mts file containing JSX would be parsed as TS and fail to lex <Foo /> correctly. Probably fine for App Router (always .tsx) but worth a comment if intentional, or extend the check to .mtsx/.ctsx if you want it tighter.

  5. version-range.ts:38 — wildcard regex. /^(?:\*|x|x\.x|x\.x\.x|\d+\.x|\d+\.x\.x|latest)$/i is fine for the common cases, but 1.2.x (very common npm shape) would slip through because the regex requires either \d+\.x (matches 1.x) or \d+\.x\.x (matches 1.x.x) — 1.2.x matches neither. Add \d+\.\d+\.x to the alternation.

  6. init-options-commented.ts:91-109 — early-return on the first init() call. If the file has multiple init() calls (which Layer 0's single-init-call already hard-fails on, but this scorer runs independently if Layer 0 isn't checking the same file), only the first one is examined. Mostly defensive — the loop with for (const call of calls) followed by return inside the body means subsequent calls never get inspected. If the first init has zero options the function returns the empty-object pass, even if a second init has 12 uncommented options. Minor.

Tests

  • 14 negative-path tests across the 6 new scorers — good coverage of the unhappy paths the scorers are designed to catch. Specifically nice to see the 'use client' exemption test for criterion 11 and the workspace-protocol-allowed branch missing from the version-range test (worth adding).
  • The success-path test (runner.test.ts) is unchanged but the PR description claims it's now 85/85 — the assertion still uses >= 45 which won't catch a regression where a Layer 2 scorer fails to load. Consider tightening per finding (7).
  • Property-based / fuzz coverage isn't required at this stage but the regex in version-range.ts would benefit from a parametrized table test (['*', 'x', '1.x', '1.x.x', 'latest', '1.2.3-alpha', '1.2.3-rc.1', '0.0.0-canary-abc']).
  • Tests directly mutate process.env.EVALS_WORKING_DIR and clean up in afterEach — fine for vitest's default sequential-within-file model but worth noting if anyone later switches the eval suite to --threads=true.

Security

No new credential / file-write / redaction surface in the Layer 2 commit (those landed in #560). The scorers read files via readFileSync(join(root, path)) where root is process.env.EVALS_WORKING_DIR and path comes from the artifact's diff. There's no .. sanitization on path, but the diff originates from the runner's own fs-snapshot (not user input), so traversal isn't a realistic threat here. Acceptable.

Verdict

Approve with nits. Findings 1–4 are real but small in blast radius (warn-only or scoped to App Router scenarios that don't yet exist beyond the vanilla golden). Finding 3 (track-name false positive) is the one I'd most want addressed before this scorer stack runs against a real customer-shaped codebase, because it can hard-fail a clean integration. The rest are doc/code drift and clean-up.

No blocking concerns on the architecture: the generic layer iteration in score.ts, the AST helpers, and the failure-mode test pattern are all in good shape and set up well for Layer 3+.

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Proposed-before-confirmed ordering not enforced in event collection
    • Refactored the function to find the last confirmation first, then search only up to that index for the last proposal, ensuring temporal ordering is enforced.

Create PR

Or push these changes by commenting:

@cursor push d1382d30ed
Preview (d1382d30ed)
diff --git a/evals/scorers/layer1-structural/confirmed-events-tracked.ts b/evals/scorers/layer1-structural/confirmed-events-tracked.ts
--- a/evals/scorers/layer1-structural/confirmed-events-tracked.ts
+++ b/evals/scorers/layer1-structural/confirmed-events-tracked.ts
@@ -35,15 +35,22 @@
   // `event_plan_confirmed`. If there's a confirmed event at all, the
   // proposal it confirmed is the canonical list.
   const out = new Set<string>();
-  let lastProposedIndex = -1;
   let confirmedIndex = -1;
+  // First find the last confirmation
   for (let i = 0; i < artifact.runLog.length; i++) {
     const data = artifact.runLog[i].data as
       | { event?: string; events?: Array<{ name?: string }> }
       | undefined;
-    if (data?.event === 'event_plan_proposed') lastProposedIndex = i;
     if (data?.event === 'event_plan_confirmed') confirmedIndex = i;
   }
+  // Then find the last proposal before that confirmation
+  let lastProposedIndex = -1;
+  for (let i = 0; i < confirmedIndex; i++) {
+    const data = artifact.runLog[i].data as
+      | { event?: string; events?: Array<{ name?: string }> }
+      | undefined;
+    if (data?.event === 'event_plan_proposed') lastProposedIndex = i;
+  }
   if (confirmedIndex >= 0 && lastProposedIndex >= 0) {
     const proposed = artifact.runLog[lastProposedIndex].data as {
       events?: Array<{ name?: string }>;

You can send follow-ups to the cloud agent here.

Comment thread evals/scorers/layer1-structural/confirmed-events-tracked.ts
kaiapeacock-eng and others added 2 commits May 8, 2026 09:20
Layer 2 covers the "DevRel review" rules from docs/evals.md that
need AST inspection but not a build: server/client boundary,
server-SDK usage, init-options commenting, version range, surprise
deps, and property-key naming. TypeScript compiler API (already a
project dep) drives the AST work — no new heavyweight dep.

Scorers added:

  - L2-server-client-boundary (criterion 11, heavy 10) — the
    highest-signal Layer 2 check. Hard fails when an App Router
    Server Component, server action, or app/api route imports
    @amplitude/unified or @amplitude/analytics-browser.
  - L2-server-sdk-usage (criterion 12, heavy 10) — when server-only
    Amplitude usage exists, asserts it uses @amplitude/analytics-node.
  - L2-init-options-commented (criterion 7, medium 5) — the init()
    options object should carry at least one comment so togglable
    knobs aren't a black box.
  - L2-version-range (criterion 2, medium 5) — fails on wildcard
    majors and pre-release tags in @amplitude/* version specifiers.
  - L2-no-vendor-additions (criterion 3, medium 5) — diffs working
    package.json against pristine; fails on any non-Amplitude dep
    the agent added unprompted.
  - L2-property-key-naming (criterion 15, soft warn) — flags property
    keys that don't follow the lowercase-with-spaces convention.
    Soft warn — never gates merge.

score.ts now loops over every layer > 0 generically so future layers
slot in without orchestrator edits.

Tests:
  - 14 failure-mode tests across the 6 scorers
  - Existing runner.test.ts still covers the success path against
    the golden (now scoring 85/85 with Layer 2 in the stack)

Stacked on PR #560 (ba-104-add-eval-suite); will rebase onto main
after #560 lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fixes

Address kelsonpw and Bugbot review feedback on Layer 2 scorers:

Medium — false-positive surface on track() / init() (criteria 12, 15).
The scorers used findCallsByName(sf, 'track') without checking whether
the file actually imported Amplitude, so a customer codebase with
pre-existing Segment / Mixpanel calls would either trip a hard-fail on
criterion 12 (server-only file with non-Amplitude track() lacking the
node SDK import) or surface spurious property-key warnings on criterion
15. Add an importsAnyAmplitude gate to both scorers so files without an
Amplitude import are skipped. Tests cover the gate explicitly.

Low — comment / regex / docstring drift.
- property-key-naming: comment claimed hyphens were allowed; the regex
  rejects them. Fixed the comment to match the project rule (CLAUDE.md
  Analytics conventions).
- no-vendor-additions: docstring said "version-bumped within the same
  major are fine" but the implementation accepts any version change.
  Drop the false promise — enforcing major-bump semver is non-trivial
  for marginal value (the wizard isn't supposed to touch other deps).
- version-range: WILDCARD_MAJOR regex missed 1.2.x (very common npm
  shape). Added \d+\.\d+\.x to the alternation + a regression test.
- runner.test.ts: comment said "four Layer 1 weighted scorers" but
  enumerated seven. Fixed the count and noted Layer 2 adds 35 more.

Member-expression form (amplitude.track / client.track) still slips
through findCallsByName — documented as a known limitation in the
property-key-naming gate comment so future readers don't expect
member-call coverage from a soft-warn scorer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kaiapeacock-eng kaiapeacock-eng force-pushed the ba-104-layer2-scorers branch from 9663a22 to 292e73f Compare May 8, 2026 16:21

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: getDirective continues past non-directive strings instead of stopping
    • Changed continue to break on line 136 so the function stops at the first non-directive string literal, matching Next.js/React spec requirements.
  • ✅ Fixed: Exported isAtModuleScope function is never imported anywhere
    • Removed the unused isAtModuleScope function entirely as it was dead code with no current consumers.

Create PR

Or push these changes by commenting:

@cursor push cf4e185e34
Preview (cf4e185e34)
diff --git a/evals/scorers/layer2-static/_ast-helpers.ts b/evals/scorers/layer2-static/_ast-helpers.ts
--- a/evals/scorers/layer2-static/_ast-helpers.ts
+++ b/evals/scorers/layer2-static/_ast-helpers.ts
@@ -133,7 +133,7 @@
       if (v === 'use client' || v === 'use server') return v;
       // Stop at the first non-directive statement — directives must
       // appear in the prologue.
-      continue;
+      break;
     }
     break;
   }
@@ -163,33 +163,3 @@
   walk(source);
   return out;
 }
-
-/**
- * Return true if the call expression sits at the top-level module
- * scope (not inside a function, class, block, etc.) — useful for
- * "init() must not run at module scope of a Server Component."
- */
-export function isAtModuleScope(
-  call: ts.CallExpression,
-  source: ts.SourceFile,
-): boolean {
-  let parent: ts.Node | undefined = call.parent;
-  while (parent && parent !== source) {
-    if (
-      ts.isFunctionDeclaration(parent) ||
-      ts.isFunctionExpression(parent) ||
-      ts.isArrowFunction(parent) ||
-      ts.isMethodDeclaration(parent) ||
-      ts.isClassDeclaration(parent) ||
-      ts.isBlock(parent) ||
-      ts.isIfStatement(parent) ||
-      ts.isForStatement(parent) ||
-      ts.isWhileStatement(parent) ||
-      ts.isTryStatement(parent)
-    ) {
-      return false;
-    }
-    parent = parent.parent;
-  }
-  return true;
-}

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit 292e73f. Configure here.

if (v === 'use client' || v === 'use server') return v;
// Stop at the first non-directive statement — directives must
// appear in the prologue.
continue;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getDirective continues past non-directive strings instead of stopping

Low Severity

The getDirective function uses continue when it encounters a non-directive string literal (e.g. 'use strict'), which causes it to keep searching subsequent statements. Per the Next.js/React spec, 'use client' must be the very first statement — if any other expression precedes it, the directive is invalid and the file is treated as a server component. The comment even says "Stop at the first non-directive statement" but the code does the opposite. This could cause false negatives in server-client-boundary where a file with 'use strict'; 'use client'; is incorrectly classified as a client component.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 292e73f. Configure here.

parent = parent.parent;
}
return true;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exported isAtModuleScope function is never imported anywhere

Low Severity

isAtModuleScope is exported from _ast-helpers.ts but never imported by any file in the codebase. None of the six Layer 2 scorers in this PR use it. This is dead code that adds maintenance surface without current value.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 292e73f. Configure here.

@kelsonpw kelsonpw left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Six L2 scorers cover the high-signal DevRel review checks without needing a build, and the importsAnyAmplitude gate added in 292e73f is the right fix — without it, server-sdk-usage and property-key-naming would false-positive on customer codebases that already have Segment/Mixpanel calls (criterion 12 hard-fail) or surface spurious property-key warnings (criterion 15). Tests cover the gate explicitly.\n\nGenericizing score.ts to loop over every layer > 0 means future layers slot in without orchestrator edits — paid off immediately for L3 in #570.\n\nNon-blocking from Bugbot:\n- evals/scorers/layer2-static/_ast-helpers.ts:171-195 — exported isAtModuleScope is unused; either consume it or drop it before this becomes maintenance debt.\n- evals/scorers/layer2-static/_ast-helpers.ts:136 — getDirective's continue-past-non-directive-strings means a leading 'use strict' would let it find a 'use client' later, but per the Next.js spec the directive must be the very first statement. Worth a follow-up to break instead of continue.\n- evals/scorers/layer2-static/property-key-naming.ts:34 — comment claims hyphens are allowed but the regex rejects them; fix one or the other.\n\nNone block the PR.

@kaiapeacock-eng kaiapeacock-eng merged commit eb5afc0 into main May 8, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants