perf(telemetry): unblock per-tool quality analysis (GDPR annotations + diagnostic fields + InvocationGuard) by wenytang-ms · Pull Request #1650 · microsoft/vscode-java-debug

wenytang-ms · 2026-06-09T06:45:21Z

Why

A 30-day deep-dive against ddtelfiltered confirmed that none of the diagnostic fields added in #1644 are reaching the cluster:

Field	Hits / 14d (3,773 LMT events)
`outcome` (success/failure/timeout/cancelled)	0
`errorcategory` (timeout / noActiveSession / …)	0
`errortype`	0
`failurereason`	0
`errorcode ≠ "0"`	0

Root cause: ddtelfiltered strips any Properties key that is not declared inside a /* __GDPR__ ... */ annotation block. #1644 added the fields to the TS layer but never added the annotations, so the schema-filter dropped them at ingest.

This also leaves several known quality issues invisible:

32.8% silent loss — debug_java_application.invoke rows with no matching outcome event within 60s
Linux start rate 4.3% vs Windows 27.7% — but we can't tell which step is dropping out
classNameDetection collapses 4 distinct failure causes (sourceDirMissing / fileNotFound / parseError / noPackageDeclaration) into one boolean
Started P95 = 100s but SESSION_WAIT_TIMEOUT was 45s, so ~10% of legitimate slow starts were being recorded as timeouts

What

`src/lmToolTelemetry.ts` (commit 1 — telemetry contract)

GDPR annotation blocks above all 3 sanitizedSend call sites (recordToolInvocation, recordChatActivation, recordLaunchInternal) — fixes the root cause of fields being filtered.
5 new enums replacing stringly-typed fields: OperatingSystem, ClassNameDetectionStrategy, ClassNameDetectionFailure, SentinelOutcome.
ToolInvocationRecord extended with os / javaMajorVersion / projectSystem / retryCount / previousOutcome.
LaunchInternalEvent union extended with 4 new sentinel variants:
- debugSession.sentinel — emitted on every invoke so even infinite hangs leave a trail
- debugSession.silentReturn — invoke returned with no outcome (the 32.8% gap)
- debugSession.cancelled / debugSession.exception — closed-loop attribution
- classNameDetection.failed — replaces the collapsed boolean
- debugSessionStarted/Timeout now carry elapsedMs + thresholdMs for histograms
InvocationGuard (beginDebugSessionInvocation()) — markOutcomeRecorded / markException / close for closed-loop sentinel tracking.
SessionInvocationTracker (nextAttempt() / completeAttempt()) — per-window, per-tool in-memory retry counter. Stores only enum + integer, no user data.

`src/languageModelTool.ts` (commit 2 — wiring)

Timeout tuning (driven by 30d Started latency: P50=12s / P90=67s / P95=100s):
- SESSION_WAIT_TIMEOUT: 45000 → 120000
- SMART_POLLING_MAX_WAIT: 15000 → 90000
All 10 invoke handlers now thread os / javaMajorVersion / retryCount / previousOutcome into recordToolInvocation, and call completeAttempt after.
debug_java_application is additionally wrapped in beginDebugSessionInvocation() so the 32.8% silent-loss bucket gets closed-loop attribution.
findFullyQualifiedClassName() rewritten to return { className, failureReason, strategy }. Failure branch emits classNameDetection.failed with the structured reason.
Started / Timeout events now carry elapsedMs + thresholdMs.

Privacy

All new fields are either:

Enums with a fixed allowlist (os, failureReason, strategy, previousOutcome, …), or
Integers (retryCount, elapsedMs, thresholdMs, …), or
Already-classified categories from Harden Language Model Tool telemetry against PII leaks #1644.

No free-form strings, no user paths, no class names, no expression text. Each field has a classification + purpose + owner declared inside the matching __GDPR__ block.

getJavaMajorVersion() reads only the major number (e.g. "21", "8") from the redhat.java extension API or JAVA_VERSION env var — never the full vendor string.

Validation plan (after ship)

debugagent/lmt-quality-queries.md has the 10 queries that become runnable once data flows:

Outcome distribution by tool — should reach 24+ field coverage (vs 0 today)
Silent-loss bucket — should drop to <5% once debugSession.silentReturn lands
Per-OS funnel — verify whether the timeout bump closes the Linux 4.3% gap
classNameDetection cause breakdown — should split 66 noPackage events into 4 buckets
Retry attribution — confirm the 17.7% → 64.9% success-by-retry-count pattern
Started latency histogram — verify the 120s threshold moves the P95 mass

Test

npx tsc --noEmit — clean
npm run tslint — clean
Integration tests (npm test) require VS Code download; not run in this CI pass

Builds on feat(copilot): add chat skills, instructions and when-gates for langu… #1643 (chat skills + when-gates) and Harden Language Model Tool telemetry against PII leaks #1644 (PII hardening)
Companion analysis: debugagent/06-深度诊断-PR定向-2026-06-09.md, debugagent/05-PR1643-效果验证-2026-06-09.md

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

…ionGuard infra Telemetry contract changes to unblock per-tool quality analysis (driven by the 9-query deep-dive in debugagent/06-深度诊断-PR定向-2026-06-09.md): 1. GDPR annotation blocks (the root cause of PR #1644 fields being filtered) - Added /* __GDPR__ ... */ blocks above the three sanitizedSend call sites (recordToolInvocation, recordChatActivation, recordLaunchInternal) so the ddtelfiltered cluster keeps new Properties keys instead of stripping them. 2. Five new enums to replace stringly-typed fields: - OperatingSystem: win | mac | linux | other - ClassNameDetectionStrategy: mavenStandard | gradleStandard | vscodeSrc | workspaceRoot - ClassNameDetectionFailure: sourceDirMissing | fileNotFound | parseError | noPackageDeclaration - SentinelOutcome: recorded | silentReturn | cancelled | exception 3. ToolInvocationRecord extended with platform/version context: - os, javaMajorVersion, projectSystem, retryCount, previousOutcome These split the per-tool funnel by OS / Java major / project system so we can confirm the Linux 4.3 percent vs Windows 27.7 percent start-rate divergence. 4. LaunchInternalEvent union extended with four new sentinel variants and elapsedMs/thresholdMs: - debugSession.sentinel: emitted on EVERY invoke so even infinite hangs leave a trail - debugSession.silentReturn: invoke returned with no outcome event (the 32.8 percent gap) - debugSession.cancelled: user cancelled mid-flight - debugSession.exception: handler threw an exception, classified - classNameDetection.failed: replaces collapsed boolean with strategy + failureReason - debugSessionStarted/Timeout now carry elapsedMs + thresholdMs for histograms 5. InvocationGuard infrastructure (beginDebugSessionInvocation()): - Returns markOutcomeRecorded / markException / close methods - Emits sentinel at begin; close auto-emits silentReturn if no outcome recorded - Lets us measure the 32.8 percent silent-loss bucket with closed-loop attribution 6. SessionInvocationTracker (nextAttempt / completeAttempt): - Per-window, per-tool in-memory retry counter - Stores ONLY enum (previousOutcome) and integer (count) — no user data - Quantifies the LM auto-retry-pays-off pattern (17.7 percent at 1 -> 64.9 percent at 6-10) Compiles clean (tsc --noEmit) and passes tslint.

…meouts Consumes the new telemetry contracts from the previous commit and threads them through all 10 LMT invoke handlers. Also tunes the two session-wait thresholds to cover the Started P95 latency observed in 30d telemetry. Timeout tuning (driven by 30d Started latency: P50=12s / P90=67s / P95=100s): SESSION_WAIT_TIMEOUT: 45000 -> 120000 (eventBased path, waitForSession=true) SMART_POLLING_MAX_WAIT: 15000 -> 90000 (default polling path) Previous values clipped ~10 percent of legitimate slow starts as timeouts and biased the two strategies against each other. Aligning at 120s/90s also lets us cleanly compare cross-strategy success rates. Per-tool context fields (added to all 10 recordToolInvocation call sites): os / javaMajorVersion: from getOs() / getJavaMajorVersion() getJavaMajorVersion probes the redhat.java extension API (cached), falls back to JAVA_VERSION env var, then 'unknown'. classifyJavaMajorVersion handles legacy '1.8.0_xxx' -> '8' and modern '21.0.1' -> '21'. retryCount / previousOutcome: from nextAttempt() / completeAttempt() Per-window, per-tool counter. Used to test the LM auto-retry-pays-off pattern. debug_java_application handler: Wraps the entire invoke in a beginDebugSessionInvocation() guard so even infinite hangs leave a sentinel + silentReturn trail. This is the closed-loop half of the 32.8 percent silent-loss bucket detected by D2 in the deep-dive. classNameDetection rewrite: findFullyQualifiedClassName() now returns { className, failureReason, strategy } instead of string|null. The failure branch emits the new classNameDetection.failed event with structured strategy + failureReason so we can distinguish sourceDirMissing / fileNotFound / parseError / noPackageDeclaration — the previous boolean detected:false collapsed all four causes. Timeout / Started events: debugSessionStarted.eventBased + debugSessionTimeout.eventBased + debugSessionTimeout.smartPolling now carry elapsedMs + thresholdMs so we can histogram the Started/Timeout latency distribution and verify the threshold bump moves the right mass. Compiles clean (tsc --noEmit) and passes tslint.

Copilot

Pull request overview

This PR extends the Language Model Tool (LMT) telemetry contract and wiring to unblock per-tool quality analysis in downstream pipelines that enforce /* __GDPR__ */ allowlists. It also adds richer diagnostic context (OS/Java version/retry attribution) and introduces a closed-loop “InvocationGuard” concept intended to attribute silent debug_java_application outcomes.

Changes:

Add GDPR annotation blocks and extend LMT telemetry schemas with new diagnostic fields and enums in src/lmToolTelemetry.ts.
Wire new diagnostic fields + per-session retry attribution through all LMT invoke handlers; tune launch timeouts in src/languageModelTool.ts.
Add new structured class-name detection result/failure reasoning and additional launch internal events.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
src/lmToolTelemetry.ts	Adds GDPR annotations, new enums/diagnostic fields, launch internal event variants, and InvocationGuard/retry-tracker utilities.
src/languageModelTool.ts	Threads new telemetry fields into tool invocations, tunes timeouts, and refactors class-name detection to return structured results.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address review feedback on PR #1650: 1. InvocationGuard was effectively disabled — guard.markOutcomeRecorded() was called unconditionally in finally, so guard.close() could never emit debugSession.silentReturn / cancelled / exception. Moved the mark calls inline next to the four session-terminal recordLaunchInternal sites (debugSessionStarted.eventBased, debugSessionTimeout.eventBased, debugSessionDetected, debugSessionTimeout.smartPolling). Threaded guard parameter into debugJavaApplication. The catch path now lets close() emit debugSession.exception via markException as designed. 2. targetInfo formatting at the call site stringified the new structured ClassNameDetectionResult as [object Object]. Renamed local to 'detection' and explicitly read detection.className. The if-truthy branch was also always taken, suppressing the no-package warning. 3. The 'found file but no package' branch returned { className: simpleClassName } which made callers treat it as a successful detection, hiding the failure event. Now returns { className: null, failureReason: 'noPackageDeclaration' } so classNameDetection.failed fires. Behavior is preserved because the call site already falls back to input.target on null. 4. ClassNameDetectionResult is now a discriminated union (success has no failureReason; failure requires it). Eliminates the 'unused on success' sentinel and lets TS narrow at use sites. Required strict-null narrowing (className !== null) at the two call sites. tsc clean; tslint clean.

Copilot CLI added 2 commits June 9, 2026 14:29

wenytang-ms requested review from chagong, jdneo and testforstephen as code owners June 9, 2026 06:45

wenytang-ms changed the title ~~feat(telemetry): unblock per-tool quality analysis (GDPR annotations + diagnostic fields + InvocationGuard)~~ perf(telemetry): unblock per-tool quality analysis (GDPR annotations + diagnostic fields + InvocationGuard) Jun 9, 2026

This was referenced Jun 9, 2026

fix(lmtool): stop double-counting classNameDetection on every invoke #1651

Open

fix(lmtool): reframe timeout messages as transient + retry-friendly to nudge model retries #1652

Open

wenytang-ms requested a review from Copilot June 9, 2026 07:58

Copilot started reviewing on behalf of wenytang-ms June 9, 2026 07:58 View session

Copilot AI reviewed Jun 9, 2026

View reviewed changes

Comment thread src/languageModelTool.ts

Comment thread src/languageModelTool.ts

Comment thread src/languageModelTool.ts Outdated

Comment thread src/languageModelTool.ts Outdated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(telemetry): unblock per-tool quality analysis (GDPR annotations + diagnostic fields + InvocationGuard)#1650

perf(telemetry): unblock per-tool quality analysis (GDPR annotations + diagnostic fields + InvocationGuard)#1650
wenytang-ms wants to merge 3 commits into
mainfrom
feat/lmt-telemetry-diagnostic-fields

wenytang-ms commented Jun 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wenytang-ms commented Jun 9, 2026

Why

What

src/lmToolTelemetry.ts (commit 1 — telemetry contract)

src/languageModelTool.ts (commit 2 — wiring)

Privacy

Validation plan (after ship)

Test

Related

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`src/lmToolTelemetry.ts` (commit 1 — telemetry contract)

`src/languageModelTool.ts` (commit 2 — wiring)