Skip to content

Refactor gRPC runtime services and add Node.js runtime metrics#139

Merged
wu-sheng merged 7 commits into
apache:masterfrom
songzhendong:feature/nodejs-grpc-v1-runtime-metrics
Jun 28, 2026
Merged

Refactor gRPC runtime services and add Node.js runtime metrics#139
wu-sheng merged 7 commits into
apache:masterfrom
songzhendong:feature/nodejs-grpc-v1-runtime-metrics

Conversation

@songzhendong

Copy link
Copy Markdown
Contributor

Branch: feature/nodejs-grpc-v1-runtime-metrics (based on upstream master @ 1d3ba73)
Commit: 481b290 (481b290beadfc530e4b0adef05dbbbb211af0a66) — single commit, 26 files changed

CI status (fork validation, commit 481b290)

All checks passed on fork before upstream submission (2026-06-26, 34/34):

Workflow Result Run
Build (Node 20 / 22 / 24) https://github.com/songzhendong/skywalking-nodejs/actions/runs/28235029528
Test (Lint / TestLib / TestPlugins) https://github.com/songzhendong/skywalking-nodejs/actions/runs/28235029569
License https://github.com/songzhendong/skywalking-nodejs/actions/runs/28235029547

Key job (runtime meter E2E):

Fork PR checks: https://github.com/songzhendong/skywalking-nodejs/pull/2/checks

OAP dependency (not in this PR)

Runtime meters are reported via MeterReportService, but OAP consumption requires a separate upstream PR adding the nodejs-runtime analyzer (nodejs-runtime.yaml, 6 MAL rules). Agent-side reporting and plugin E2E meterItems are covered here.


Summary

This PR does two things:

  1. gRPC architecture v1 — Refactor Node remote reporting to match Java agent.core.*: single Channel, unified BootService lifecycle, Builder/Decorator chain; Trace / Heartbeat / Meter share one code path for cross-language alignment.
  2. Runtime metrics — Six instance_nodejs_* process meters via MeterReportService (1s collect / 1s report).

Not in this PR: release scripts, version bump, CONTRIBUTING / How-to-release, httpbin test infra, OAP analyzer.

Motivation

Before refactor, Node used a standalone GrpcProtocol with a different structure from Java agent.core. This PR converges remote reporting onto the same architecture as Java: one shared gRPC channel, unified lifecycle and extension points, lower cross-language maintenance cost, and a path consistent with Java for failover, TLS, Event/Log, etc.

Before refactor After (v1)
GrpcProtocol + two Channels (Trace / Heartbeat each own connection) One Channel via shared GRPCChannelManager (same as Java / Go)
protocol/grpc/clients/* naming diverges from Java core/remote/* class names and roles map 1:1 to Java
Meter needed a parallel reporting stack MeterSender shares BootService → GRPCChannelManager → Client with Trace/Management
Segment serialized via external SegmentObjectAdapter Segment.transform() — same pattern as Java TraceSegment.transform()
No unified prepare → boot → shutdown lifecycle ServiceManager orchestrates all BootServices

v2 out of scope: multi-address failover, full TLS CA/mTLS, Event/Log clients.

Architecture overview

Before (Node old) — dual Channel, no central Channel manager

index.ts
  └── GrpcProtocol
        ├── HeartbeatClient          → own ManagementServiceClient + Channel
        └── TraceReportClient        → own TraceSegmentReportServiceClient + Channel
              └── SegmentObjectAdapter
              └── AuthInterceptor() per RPC

After (Node v1) — single Channel, Java-aligned

index.ts
  └── ServiceManager
        ├── GRPCChannelManager       → single shared Channel (Builder + Decorator chain)
        ├── TraceSegmentServiceClient → stub after CONNECTED; Segment.transform()
        ├── ServiceManagementClient   → stub after CONNECTED; 20s heartbeat
        └── MeterSender (optional)    → stub after CONNECTED; 6× runtime meters

Java — single Channel + SPI registration

ServiceManager (ServiceLoader loads BootService)
  ├── GRPCChannelManager           → single shared Netty Channel; periodic check + multi-address rotation
  ├── TraceSegmentServiceClient
  ├── ServiceManagementClient
  ├── MeterSender                  → JVM runtime + custom meters (MeterService API)
  ├── EventReportServiceClient
  └── LogReportServiceClient
        └── TraceSegment.transform()
flowchart TB
  subgraph nodeBefore["Node before refactor"]
    GP[GrpcProtocol] --> HB[HeartbeatClient Ch1]
    GP --> TR[TraceReportClient Ch2]
  end

  subgraph nodeV1["Node v1"]
    SM[ServiceManager] --> CM[GRPCChannelManager x1]
    CM --> TC[TraceSegmentServiceClient]
    CM --> MC[ServiceManagementClient]
    CM --> MS[MeterSender]
  end

  subgraph javaAgent["Java agent"]
    JSM[ServiceManager SPI] --> JCM[GRPCChannelManager x1]
    JCM --> JTC[TraceSegmentServiceClient]
    JCM --> JMC[ServiceManagementClient]
    JCM --> JMS[MeterSender]
    JCM --> JEV[EventReportServiceClient]
    JCM --> JLG[LogReportServiceClient]
  end

  nodeBefore ~~~ nodeV1 ~~~ javaAgent
Loading

Reference baseline

  • This PR (Node v1): feature/nodejs-grpc-v1-runtime-metrics (src/agent/core/)
  • Before refactor (upstream master): 1d3ba73 (src/agent/protocol/)
  • Java agent: apm-agent-core (org.apache.skywalking.apm.agent.core.{boot,remote,meter})

Alignment with Java Agent

Category Aspect This PR (Node v1) Java agent Status
Architecture Package layout agent/core/{boot,remote,meter}/ agent.core.{boot,remote,meter}/
Bootstrap ServiceManager.INSTANCE.boot() ServiceManager.INSTANCE.boot()
Lifecycle BootService (prepare/boot/shutdown/priority) Same
Service registration Hard-coded 4 services ServiceLoader + SPI ⚠️ no SPI
Segment serialization Segment.transform() TraceSegment.transform()
flush / shutdown ServiceManager.flush/shutdown ServiceManager unified dispatch
gRPC Channel Channel count 1 shared GRPCChannelManager 1 shared
Channel build Builder chain: Standard → TLS → Decorators Same
Connectivity notify GRPCChannelListener CONNECTED/DISCONNECT Same
Stub creation After CONNECTED callback After CONNECTED callback
Message size limit 50 MB 50 MB
Multi-address failover First comma-separated address only List rotation + periodic reconnect ❌ v2
Error-driven reconnect reportError() stub (debug log) reportError → reconnect ❌ v2
Periodic Channel check None GRPC_CHANNEL_CHECK_INTERVAL ❌ v2
Force reconnect / DNS re-resolve None FORCE_RECONNECTION_PERIOD, etc. ❌ v2
TLS SW_AGENT_SECURE TLSChannelBuildercreateSsl() Full TLSChannelBuilder chain ⚠️ minimal TLS
Custom CA / mTLS Not implemented agent/ca/ + cert path config ❌ v2
TLS E2E CI None simple/ssl, simple/mtls ❌ v2
Remote clients Trace reporting TraceSegmentServiceClient TraceSegmentServiceClient
Heartbeat / instance ServiceManagementClient (20s) ServiceManagementClient
Auth / Agent-Version headers AuthenticationDecorator + AgentIDDecorator Same
Event / Log Not implemented EventReportServiceClient / LogReportServiceClient ❌ v2
Stream completion wait GRPCStreamServiceStatus ported, not wired Used for stream RPCs ❌ v2
Runtime meter Reporting component MeterSender (BootService + shared Channel) MeterSender (BootService + shared Channel)
Collect / report period Default 1s / 1s (configurable) JVM runtime default 1s / 1s
RPC service MeterReportService (generic Meter protobuf) JVMMetricReportService (JVM protobuf) ⚠️ different RPC / message schema (both gRPC)
Metric coverage instance_nodejs_* (CPU + memory) heap / GC / thread / classloader / … ⚠️ different coverage
Meter types MeterSingleValue gauge Counter / Gauge / Histogram ⚠️ no Counter/Histogram
Custom business meters No public API (README suggests OTel) MeterFactory / MeterService (≤500) ❌ v2
Buffer limit SW_AGENT_NODEJS_RUNTIME_METRICS_BUFFER_SIZE = 600 Config.Jvm.BUFFER_SIZE = 600
Per-metric timestamp One shared timestamp per stream batch Each JVMMetric has its own time ⚠️ precision difference
Retry on failure Drop oldest when full; discard batch on send failure offer full → drop oldest; send failure not re-queued
CPU calculation process.cpuUsage() ÷ logical cores JVM MXBean multi-dimensional ⚠️ different algorithm
OAP consumption nodejs-runtime.yaml (6 MAL rules) Built-in JVM analyzer ⚠️ needs OAP analyzer
Tests Runtime meter unit tests RuntimeMetricsCollector.test.ts
Plugin E2E meter assertions express meterItems (server + client, 6 meters each)
TLS E2E CI None (plaintext gRPC CI) Yes ❌ v2

Design note: Node runtime metrics use MeterReportService (same as Go/Python), not Java JVMMetricReportService.

Six meters: instance_nodejs_process_cpu, instance_nodejs_heap_used, instance_nodejs_heap_total, instance_nodejs_heap_limit, instance_nodejs_rss, instance_nodejs_external_memory

File / class mapping (before → v1 → Java)

Role Node before Node v1 Java
Bootstrap GrpcProtocol.ts ServiceManager.ts ServiceManager.java
Lifecycle contract Protocol.ts BootService.ts BootService.java
Channel manager ❌ (scattered in clients) GRPCChannelManager.ts GRPCChannelManager.java
Channel build GRPCChannel.ts GRPCChannel.java
Plaintext builder StandardChannelBuilder.ts StandardChannelBuilder.java
TLS builder ❌ (inline in client) TLSChannelBuilder.ts TLSChannelBuilder.java
Auth decorator AuthInterceptor.ts AuthenticationDecorator.ts AuthenticationDecorator.java
Agent-Version decorator AgentIDDecorator.ts AgentIDDecorator.java
Connection status Client.ts GRPCChannelStatus.ts + Listener GRPCChannelStatus.java + Listener
Trace client TraceReportClient.ts TraceSegmentServiceClient.ts TraceSegmentServiceClient.java
Heartbeat client HeartbeatClient.ts ServiceManagementClient.ts ServiceManagementClient.java
Segment transform SegmentObjectAdapter.ts Segment.transform() TraceSegment.transform()
Meter meter/{RuntimeSampler,RuntimeMetricsCollector,MeterSender}.ts meter/{MeterSender,MeterService,...}.java
Event client EventReportServiceClient.java
Log client LogReportServiceClient.java

gRPC v1 implemented

Boot lifecycle

Module Path Notes
Boot contract src/agent/core/boot/BootService.ts Aligns with Java: prepare()boot()onComplete()shutdown(), priority() ordering
Bootstrap src/agent/core/boot/ServiceManager.ts Singleton manages all BootService; unified boot() / shutdown() / flush()
Entry src/index.ts Replaces old GrpcProtocol().heartbeat().report() with ServiceManager

Services registered by ServiceManager (hard-coded, no Java SPI):

Service Priority / condition Role
GRPCChannelManager Always Shared gRPC Channel, connectivity listener
TraceSegmentServiceClient Always Trace segment streaming
ServiceManagementClient Always Instance properties + keepAlive heartbeat (20s)
MeterSender runtimeMetricsReporterActive=true Runtime meter collect + report

gRPC Channel stack (remote package)

GRPCChannelManager
  └── GRPCChannel.newBuilder(host, port)
        ├── StandardChannelBuilder   → insecure creds + 50MB message limit
        ├── TLSChannelBuilder        → SW_AGENT_SECURE=true → createSsl()
        ├── AgentIDDecorator         → Agent-Version metadata
        └── AuthenticationDecorator  → SW_AGENT_AUTHENTICATION header
File Java counterpart v1 behavior
GRPCChannelManager.ts GRPCChannelManager.java Single-address parse; watchConnectivityState notifies listeners; reportError() debug log only
GRPCChannel.ts GRPCChannel.java Builder chain + channelOverride / interceptors
ChannelBuilder.ts ChannelBuilder.java credentials / options context
StandardChannelBuilder.ts StandardChannelBuilder.java Plaintext + message size limit
TLSChannelBuilder.ts TLSChannelBuilder.java Minimal: SW_AGENT_SECURE=truegrpc.credentials.createSsl()
ChannelDecorator.ts ChannelDecorator.java grpc-js interceptor decorator interface
AgentIDDecorator.ts AgentIDDecorator.java Agent version header
AuthenticationDecorator.ts AuthenticationDecorator.java OAP auth token
GRPCChannelListener.ts GRPCChannelListener.java CONNECTED / DISCONNECT callbacks
GRPCChannelStatus.ts (from Node Client.ts enum) Connection status enum

Remote clients (listener-driven stubs)

File Old Node name gRPC service v1 behavior
TraceSegmentServiceClient.ts TraceReportClient TraceSegmentReportService Stub on CONNECTED; buffer + periodic flush; reportError() on failure
ServiceManagementClient.ts HeartbeatClient ManagementService Stub on CONNECTED; 20s instance properties + keepAlive

Segment serialization: new Segment.transform(): SegmentObject in src/trace/context/Segment.ts; removed SegmentObjectAdapter.ts (same pattern as Java TraceSegment.transform()).

Runtime meter (shared Channel)

File Notes
RuntimeSampler.ts Single sample: memoryUsage + v8.getHeapStatistics() + cpuUsage
RuntimeMetricsCollector.ts Maps to 6 meters
MeterSender.ts BootService + GRPCChannelListener; 1s collect / 1s report (configurable); MeterReportService.collect streaming
Meter name Meaning Notes
instance_nodejs_process_cpu Process CPU % (user+system, ÷ logical cores; see README)
instance_nodejs_heap_used V8 heap used bytes
instance_nodejs_heap_total V8 heap total bytes
instance_nodejs_heap_limit V8 heap limit bytes
instance_nodejs_rss RSS bytes
instance_nodejs_external_memory External memory bytes

Configuration

New environment variables:

Variable Default Notes
SW_AGENT_NODEJS_RUNTIME_METRICS_REPORTER_ACTIVE true (explicit false disables) Enable runtime meter
SW_AGENT_NODEJS_RUNTIME_METRICS_COLLECT_PERIOD 1000 ms Sample interval (local buffer, default 1s)
SW_AGENT_NODEJS_RUNTIME_METRICS_REPORT_PERIOD 1000 ms (1s) Report interval, aligned with Java JVM runtime (1s)
SW_AGENT_NODEJS_RUNTIME_METRICS_BUFFER_SIZE 600 Buffer size

Deprecated aliases (still readable): SW_AGENT_RUNTIME_METRICS_*, SW_AGENT_NVM_METRICS_*, SW_AGENT_NVM_JVM_*.

Existing gRPC settings (semantics unchanged):

  • SW_AGENT_COLLECTOR_BACKEND_SERVICES — v1 uses first comma-separated address only
  • SW_AGENT_SECURE — enables TLS ChannelBuilder
  • SW_AGENT_AUTHENTICATION — OAP authentication

Removed / replaced Node-specific abstractions

Removed Replaced by
src/agent/protocol/grpc/GrpcProtocol.ts ServiceManager + clients
src/agent/protocol/grpc/SegmentObjectAdapter.ts Segment.transform()
src/agent/protocol/grpc/clients/* src/agent/core/remote/* (renamed + refactored)
src/agent/protocol/Protocol.ts src/agent/core/boot/BootService.ts

Entire src/agent/protocol/ directory removed.

Tests and docs

Item Status
Unit tests tests/runtime/RuntimeMetricsCollector.test.ts — 6 meter names and non-negative values
Plugin E2E tests/plugins/express/expected.data.yamlonly added meterItems (one block with server + client entries, 6 meters each); httpbin / docker-compose unchanged
CI Build / Lint / TestLib / TestPlugins, Linux Node 20/22/24 (plaintext gRPC)
README Node.js ≥ 20, runtime metrics table and env docs

TLS v1 scope

  • SW_AGENT_SECURE=trueTLSChannelBuildergrpc.credentials.createSsl() (system default trust store)
  • Suitable for public CA / enterprise PKI OAP certificates
  • CI does not cover TLS (same as Python / Go / PHP agents, plaintext collector:11800); self-signed CA / mTLS need v2 SW_AGENT_CA_PATH (NODE_EXTRA_CA_CERTS ineffective for @grpc/grpc-js)

Manual smoke test (2026-06-26, maintainer-local — not in CI):

Check Method Result
Setup Docker OAP (BanyanDB) + Node Agent; OAP gRPC TLS on host :11811
OAP SW_CORE_GRPC_SSL_ENABLED=true, certs from tls_key_generate.sh (SERVER_CN=localhost)
Agent SW_AGENT_SECURE=true, SW_AGENT_COLLECTOR_BACKEND_SERVICES=localhost:11811
TLS handshake openssl s_client -connect 127.0.0.1:11811 -servername localhost, ALPN h2
Service registration OAP GraphQL getAllServices
Instance heartbeat GraphQL getServiceInstances (Agent running ≥25s, UTC window)

Procedure: start TLS OAP → Agent (secure: true) → local HTTP request → agent.flush() → GraphQL check services and instances. Self-signed CA required loading root CA into createSsl(rootCerts) (temporary patch, not in this PR); public/enterprise CA works with SW_AGENT_SECURE=true alone. TLS verification should be covered by CI E2E in a follow-up (v2).

Out of scope (v2)

Channel failover and reconnect

Per GRPCChannelManager.ts comments: multi-address backend rotation, reportError()-driven failover, periodic Channel check, forced reconnect, periodic DNS re-resolve.

TLS / mTLS

Relative to Java TLSChannelBuilder: custom CA path, mTLS client certs, SW_AGENT_CA_PATH (or equivalent), self-signed TLS E2E CI.
Note: NODE_EXTRA_CA_CERTS is ineffective for @grpc/grpc-js; must pass CA explicitly to createSsl(ca).

Remote clients not ported

  • EventReportServiceClient (proto exists, no agent client)
  • LogReportServiceClient (proto exists, no agent client)

Skeleton / unwired code

  • GRPCStreamServiceStatus.ts: ported from Java, not referenced by Trace/Meter clients yet
  • Java ServiceLoader SPI: Node hard-codes 4 services, no plugin discovery

Test and API gaps

  • TLS / mTLS E2E CI
  • Multi-address failover tests
  • Custom business metrics public API (README suggests OpenTelemetry)
  • instance_nodejs_active_handles not in the current 6-meter set

Runtime requirements

  • Node >= 20 required; CI matrix 20/22/24

- Refactor remote reporting to agent.core: single shared gRPC channel, ServiceManager/BootService lifecycle, Builder/Decorator chain

- Add six instance_nodejs_* runtime meters via MeterReportService (1s/1s)

- Rename env vars to SW_AGENT_NODEJS_RUNTIME_METRICS_* with deprecated aliases
@wu-sheng wu-sheng added this to the 0.10.0 milestone Jun 27, 2026
@wu-sheng wu-sheng added the core label Jun 27, 2026

@wu-sheng wu-sheng left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the refactor. I like the direction of moving lifecycle and gRPC management into explicit services, but I don't think this is safe to merge yet. The main issue is that several control-flow pieces are copied from the Java agent model, while the Node agent must be stricter: no blocking waits, no stale async callbacks after shutdown, no stale clients after disconnect, and no duplicated report loops. This code runs inside the user's application event loop, so crash guards and cheap async behavior matter more than Java-style parity.

Blocking: destroy() can crash the instrumented application

GRPCChannelManager.shutdown() closes the grpc-js channel and then sets managedChannel to null, but a grpc-js connectivity watcher can still be called after the close. That callback calls watchConnectivityState() again, which dereferences managedChannel!.

// BUG: shutdown() can trigger grpc-js connectivity watchers, then the
// manager immediately removes the channel those callbacks still reference.
shutdown(): void {
  this.managedChannel?.shutdownNow();
  this.managedChannel = null;
}

// BUG: this callback can run after shutdown(). The recursive call re-enters
// getChannel(), which dereferences managedChannel!, causing an uncaught error
// in the user's process.
private watchConnectivityState(): void {
  const channel = this.getChannel();
  const currentState = channel.getConnectivityState(true);

  channel.watchConnectivityState(currentState, Infinity, (error) => {
    if (error) {
      logger.debug('Channel connectivity watch error: %s', error.message);
      return;
    }

    const ready = channel.getConnectivityState(false) === grpc.connectivityState.READY;
    this.notify(ready ? GRPCChannelStatus.CONNECTED : GRPCChannelStatus.DISCONNECT);
    this.watchConnectivityState();
  });
}

I reproduced this against this PR:

destroy returned
uncaught TypeError: Cannot read properties of null (reading 'getChannel')

Node-native suggestion: treat grpc callbacks as stale-able async lifecycle events. Keep an explicit closed flag, capture the managed channel for the watcher, and never re-watch after shutdown or after the manager has moved to a different channel.

// Node-native agent style: shutdown is idempotent, and stale grpc-js callbacks
// are ignored instead of being allowed to crash the application.
private closed = false;

shutdown(): void {
  this.closed = true;

  const managed = this.managedChannel;
  this.managedChannel = null;

  managed?.shutdownNow();
}

private watchConnectivityState(): void {
  const managed = this.managedChannel;
  if (this.closed || !managed) {
    return;
  }

  const channel = managed.getChannel();
  const currentState = channel.getConnectivityState(true);

  channel.watchConnectivityState(currentState, Infinity, (error) => {
    if (this.closed || this.managedChannel !== managed) {
      return;
    }

    if (error) {
      logger.debug('Channel connectivity watch stopped: %s', error.message);
      return;
    }

    const ready = channel.getConnectivityState(false) === grpc.connectivityState.READY;
    this.notify(ready ? GRPCChannelStatus.CONNECTED : GRPCChannelStatus.DISCONNECT);
    this.watchConnectivityState();
  });
}

High: heartbeat keeps using a stale client after disconnect

ServiceManagementClient creates the management client on CONNECTED, but never clears it on DISCONNECT. The heartbeat timer only checks whether the object exists, so once the agent has connected once, periodic management calls can continue through a stale client after disconnect. The calls also don't have per-call deadlines.

// BUG: this only checks whether a client object exists. In Node, object
// existence is not a connection-state guard for async RPC work.
this.heartbeatTimer = setInterval(() => {
  if (!this.managementServiceClient) {
    return;
  }

  // BUG: these can still be issued after DISCONNECT, and neither call has a
  // deadline. The agent should avoid producing work while disconnected.
  this.managementServiceClient.reportInstanceProperties(instanceProperties, new grpc.Metadata(), callback);
  this.managementServiceClient.keepAlive(keepAlivePkg, new grpc.Metadata(), callback);
}, 20000).unref();

// BUG: DISCONNECT does not clear the stale client or record current status.
statusChanged(status: GRPCChannelStatus): void {
  if (status === GRPCChannelStatus.CONNECTED) {
    this.managementServiceClient = this.createManagementClient();
  }
}

Node-native suggestion: keep explicit local status, clear the client on disconnect, and gate every periodic async task on the current status. Also pass deadlines to management RPCs.

// Node-native agent style: cheap local state prevents unnecessary async RPCs
// while the backend is unavailable.
private status = GRPCChannelStatus.DISCONNECT;
private managementServiceClient?: ManagementServiceClient;

statusChanged(status: GRPCChannelStatus): void {
  this.status = status;
  this.managementServiceClient =
    status === GRPCChannelStatus.CONNECTED ? this.createManagementClient() : undefined;
}

private sendHeartbeat(): void {
  if (this.status !== GRPCChannelStatus.CONNECTED || !this.managementServiceClient) {
    return;
  }

  const options = { deadline: Date.now() + config.traceTimeout };

  this.managementServiceClient.keepAlive(keepAlivePkg, new grpc.Metadata(), options, (error) => {
    if (error) {
      logHeartbeatError('Failed to send heartbeat', error);
      ServiceManager.INSTANCE.findService(GRPCChannelManager)?.reportError(error);
    }
  });
}

High: reportError() is a no-op, so clients cannot stop work quickly

Several clients call GRPCChannelManager.reportError(error) as if it updates channel state, but the implementation only logs. For a Node agent, failed network calls should quickly flip local state to DISCONNECT so timers stop doing avoidable work until grpc-js reconnects.

// BUG: callers report network errors here, but this does not notify
// DISCONNECT or otherwise affect agent behavior.
reportError(error: unknown): void {
  logger.debug('gRPC report error (multi-address failover reserved for V2): %s', error);
}

Node-native suggestion: let grpc-js handle transport reconnects, but still use reportError() as the agent's cheap local disconnect signal.

// Node-native agent style: no blocking reconnect loop is needed, but local
// services need DISCONNECT so periodic work can stop immediately.
reportError(error: unknown): void {
  if (!isGrpcNetworkError(error)) {
    return;
  }

  this.notify(GRPCChannelStatus.DISCONNECT);
}

function isGrpcNetworkError(error: unknown): boolean {
  const code = (error as grpc.ServiceError | undefined)?.code;

  return (
    code === grpc.status.UNAVAILABLE ||
    code === grpc.status.PERMISSION_DENIED ||
    code === grpc.status.UNAUTHENTICATED ||
    code === grpc.status.RESOURCE_EXHAUSTED ||
    code === grpc.status.UNKNOWN
  );
}

Medium: Java-style blocking wait was ported into Node

GRPCStreamServiceStatus appears unused, but this pattern should not land in Node agent core. If it is called later, it will burn CPU and block the user's application event loop.

// BUG: Java-thread style wait. Node has one application event loop, so this
// blocks the instrumented application, not just the agent.
wait4Finish(): void {
  while (!this.status) {
    this.try2Sleep(recheckCycle);
  }
}

private try2Sleep(millis: number): void {
  const end = Date.now() + millis;
  while (Date.now() < end) {
    // Busy wait: no async yield, no timer, event loop blocked.
  }
}

Node-native suggestion: remove this unused class. If stream completion needs to be awaited later, use the grpc-js callback or a Promise. No polling, no sleep, no busy wait.

// Node-native stream completion: grpc-js already tells us when the stream is
// done. Await that callback instead of simulating Thread.sleep().
return new Promise<void>((resolve) => {
  const stream = client.collect(metadata, { deadline }, (error) => {
    if (error) {
      logReportError('Failed to report data', error);
      channelManager.reportError(error);
    }

    resolve();
  });

  for (const item of batch) {
    stream.write(item);
  }

  stream.end();
});

Medium: flush() can create extra report timers

flush() manually calls reportFunction(), while the normal scheduled timeout can still be pending. Then reportFunction() always schedules another timeout in finally. Repeated flushes, especially in Lambda-style usage, can multiply report loops.

// BUG/RISK: manual flush runs reportFunction() without cancelling or reusing
// the already scheduled report timer.
flush(): Promise<unknown> | null {
  return this.buffer.length === 0
    ? null
    : new Promise((resolve) => {
        this.reportFunction(resolve);
      });
}

// BUG/RISK: every reportFunction() call schedules another loop, regardless of
// whether one was already scheduled before flush().
finally {
  this.timeout = setTimeout(this.reportFunction.bind(this), 1000);
  this.timeout.unref();
}

Node-native suggestion: keep one scheduled timer and one in-flight report Promise. Frequent flush() calls should reuse or force the same async report work, not create additional loops.

// Node-native agent style: one timer, one in-flight report, explicit reschedule.
private timeout?: NodeJS.Timeout;
private reporting?: Promise<void>;

private scheduleNextReport(): void {
  if (this.timeout) {
    return;
  }

  this.timeout = setTimeout(() => {
    this.timeout = undefined;
    void this.reportOnce().finally(() => this.scheduleNextReport());
  }, 1000);

  this.timeout.unref();
}

private reportOnce(): Promise<void> {
  if (this.reporting) {
    return this.reporting;
  }

  this.reporting = this.doReport().finally(() => {
    this.reporting = undefined;
  });

  return this.reporting;
}

flush(): Promise<unknown> | null {
  if (this.timeout) {
    clearTimeout(this.timeout);
    this.timeout = undefined;
  }

  return this.buffer.length === 0 ? null : this.reportOnce();
}

I did validate a few things that are OK: trace and meter buffers are bounded, report/buffer logs are throttled, the metric metadata-first-element behavior matches the protocol/other agents, and channelOverride means the generated clients do use the shared grpc-js channel. npm run build passed and npx jest tests/runtime/RuntimeMetricsCollector.test.ts --runInBand passed.

- GRPCChannelManager: closed flag and stale connectivity callback guards
- GRPCChannelManager: reportError notifies DISCONNECT on network errors
- ServiceManagementClient: track status, clear client on disconnect, RPC deadlines
- TraceSegmentServiceClient: single report timer/in-flight promise, flush safety
- MeterSender: status gating and shared in-flight report promise
- Remove unused GRPCStreamServiceStatus busy-wait class
@songzhendong

songzhendong commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

@wu-sheng Thank you for the detailed review. Your feedback on Node-agent-specific lifecycle and async behavior was extremely valuable. I have addressed all five issues you raised; the fixes are summarized below.


Overview

This update adds 2 commits on feature/nodejs-grpc-v1-runtime-metrics:

Commit | Description -- | -- bc6c898 | Core fixes for all review items 916f457 | Follow-up: restart the trace report loop after manual flush()

Before updating this PR, I ran the full CI pipeline on my fork (Build / Test / License) — all 34 checks passed.


Fix details (5 items)

1. Blocking — shutdown() and stale connectivity callbacks

File: GRPCChannelManager.ts

  • Added an idempotent closed flag; reset on each boot().
  • shutdown(): set closed, hold the current managed channel, clear the reference, then call shutdownNow().
  • watchConnectivityState(): guard at entry (closed or no managed channel); inside the grpc-js callback, additionally verify closed and that this.managedChannel still matches the watched channel instance.

Effect: after agent shutdown, stale connectivity watchers can no longer call getChannel() on a torn-down manager.


2. High — heartbeat still using a stale client after disconnect

File: ServiceManagementClient.ts

  • Maintain local connection status; create the heartbeat stub only on CONNECTED, destroy it on DISCONNECT.
  • sendHeartbeat() runs only when status === CONNECTED and the management stub exists.
  • Both reportInstanceProperties and keepAlive now use per-call deadlines (config.traceTimeout).

3. High — reportError() was effectively a no-op

File: GRPCChannelManager.ts

  • Reworked reportError():
    • gRPC network errors (UNAVAILABLE, PERMISSION_DENIED, UNAUTHENTICATED, RESOURCE_EXHAUSTED, UNKNOWN) → notify(DISCONNECT), so periodic work stops immediately while grpc-js handles transport reconnect.
    • Non-network errors → log only; no disconnect signal.

4. Medium — Java-style busy-wait (GRPCStreamServiceStatus)

Change: removed GRPCStreamServiceStatus.ts (deleted in bc6c898).

  • Never used by Trace or Meter clients; dead code.
  • No busy-wait blocking logic remains in agent core.

5. Medium — flush() creating duplicate report loops

File: TraceSegmentServiceClient.ts

  • Single scheduled timer via scheduleNextReport() (no duplicate pending timeouts).
  • Single in-flight report via reportOnce() + reporting lock.
  • flush() clears any pending scheduled report before flushing.

Follow-up in 916f457:

  • After manual flush(), the 1s report loop is restarted:
    • empty buffer → scheduleNextReport();
    • data flushed → reportOnce().finally(() => scheduleNextReport()).

Also: MeterSender.ts — shared in-flight reporting promise so flush() and the periodic interval cannot run duplicate meter reports in parallel; reporting is gated on CONNECTED and a live stub.


I'd appreciate a re-review when convenient. Upstream CI on this PR: https://github.com/apache/skywalking-nodejs/pull/139/checks (34/34 passed). Thanks again!

@wu-sheng wu-sheng left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. The previous blocker is improved: I reran the original destroy() repro and it no longer crashes, the management client is cleared on DISCONNECT, the Java-style busy-wait helper was removed, and the trace flush loop now has a single in-flight report path. npm run build and npx jest tests/runtime/RuntimeMetricsCollector.test.ts --runInBand both pass locally.

I still see two Node lifecycle issues that should be fixed before merge.

Blocking: late gRPC callbacks can still crash after shutdown

Clearing timers and clients in shutdown() is not enough for grpc-js. RPC callbacks that were already registered can still fire after ServiceManager.shutdown() has cleared bootedServices. Those callbacks still do a required lookup with findService(GRPCChannelManager)!, so a late error callback can throw inside the user's process.

Examples:

// BUG: this callback can run after ServiceManager.shutdown() has cleared the
// registry. findService(...) then returns undefined, and `.reportError` throws.
this.managementServiceClient.reportInstanceProperties(
  this.instanceProperties!,
  new grpc.Metadata(),
  options,
  (error) => {
    if (error) {
      logHeartbeatError('Failed to send heartbeat', error);
      ServiceManager.INSTANCE.findService(GRPCChannelManager)!.reportError(error);
    }
  },
);

// Same issue for the second heartbeat RPC.
this.managementServiceClient.keepAlive(this.keepAlivePkg!, new grpc.Metadata(), options, (error) => {
  if (error) {
    logHeartbeatError('Failed to send heartbeat', error);
    ServiceManager.INSTANCE.findService(GRPCChannelManager)!.reportError(error);
  }
});
// BUG: trace stream callback can also run after shutdown/registry cleanup.
const stream = this.reporterClient.collect(
  new grpc.Metadata(),
  { deadline: Date.now() + config.traceTimeout },
  (error) => {
    if (error) {
      logReportError('Failed to report trace data', error);
      ServiceManager.INSTANCE.findService(GRPCChannelManager)!.reportError(error);
    }
    resolve();
  },
);
// BUG: meter stream callback and the sync catch block have the same required
// global lookup after async/lifecycle boundaries.
const stream = this.reporterClient.collect(
  new grpc.Metadata(),
  { deadline: Date.now() + (config.traceTimeout || 10000) },
  (error: grpc.ServiceError | null) => {
    if (error) {
      logReportError('Failed to report runtime meter data', error);
      ServiceManager.INSTANCE.findService(GRPCChannelManager)!.reportError(error);
    }
    resolve();
  },
);

I reproduced the issue with fake delayed callbacks by clearing the service registry before the callback fires:

registry cleared
Failed to send heartbeat
uncaught TypeError: Cannot read properties of undefined (reading 'reportError')

and the same pattern reproduces for trace and meter callbacks.

Node-native suggestion: don't resolve agent services through the mutable global registry inside async callbacks. Capture the GRPCChannelManager during prepare(), keep a local closed flag, and make late callbacks no-op after shutdown. In-flight grpc-js calls generally cannot be assumed cancelled just because the agent service shut down.

// Node-native agent style: callbacks can arrive late, so they must be cheap
// and safe after shutdown.
private closed = false;
private channelManager?: GRPCChannelManager;

prepare(): void {
  this.channelManager = ServiceManager.INSTANCE.findService(GRPCChannelManager);
  this.channelManager?.addChannelListener(this);
}

shutdown(): void {
  this.closed = true;
  clearInterval(this.heartbeatTimer);
  this.heartbeatTimer = undefined;
  this.managementServiceClient = undefined;
}

private reportGrpcError(error: unknown): void {
  if (this.closed) {
    return;
  }

  this.channelManager?.reportError(error);
}

// Callback uses the safe local helper, not a required global lookup.
(error) => {
  if (error) {
    logHeartbeatError('Failed to send heartbeat', error);
    this.reportGrpcError(error);
  }
}

The same pattern should be applied in TraceSegmentServiceClient and MeterSender.

High: reportError() can leave services disconnected while grpc-js is still READY

The updated reportError() now sends DISCONNECT, which is useful when the channel actually lost connectivity. But it can also be called for statuses that do not necessarily move the grpc-js channel out of READY (UNKNOWN, PERMISSION_DENIED, UNAUTHENTICATED, etc.). When that happens, listeners clear their clients, but the connectivity watcher may never fire another CONNECTED event because the channel state did not change.

// BUG/RISK: this can notify DISCONNECT even when the grpc-js channel remains
// READY. Since clients are recreated only on CONNECTED watcher events, they can
// stay undefined forever.
reportError(error: unknown): void {
  if (!isGrpcNetworkError(error)) {
    logger.debug('gRPC report error (ignored): %s', error);
    return;
  }

  logger.debug('gRPC network error, notify DISCONNECT: %s', error);
  this.notify(GRPCChannelStatus.DISCONNECT);
}

I simulated this state and got:

events CONNECTED,DISCONNECT
manager still sees connected true

Node-native suggestion: use grpc-js connectivity as the source of truth for local connection state, or make reportError() include a re-notification/recheck path. For example, avoid clearing clients if the shared channel is still READY, or schedule an unref'ed connectivity recheck that sends CONNECTED when managedChannel.isConnected(true) is already true.

// Sketch: avoid a permanent local DISCONNECT when the transport is still ready.
reportError(error: unknown): void {
  if (!isGrpcNetworkError(error)) {
    return;
  }

  const managed = this.managedChannel;
  if (!managed) {
    this.notify(GRPCChannelStatus.DISCONNECT);
    return;
  }

  if (managed.isConnected(false)) {
    this.notify(GRPCChannelStatus.CONNECTED);
    return;
  }

  this.notify(GRPCChannelStatus.DISCONNECT);
}

The exact policy can differ, but the important bit is that reportError() should not create a local disconnected state without a path back to connected when the underlying grpc-js channel never changed state.

Non-blocking cost note: the management client still sends both reportInstanceProperties and keepAlive on every heartbeat. That existed before this update, but for a Node agent it would be cheaper to report static instance properties less frequently and send only keepAlive on most ticks, like the Java agent does.

songzhendong added a commit to songzhendong/skywalking-nodejs that referenced this pull request Jun 27, 2026
…Error recovery

- Capture GRPCChannelManager in prepare(); reportGrpcError() no-ops after shutdown
- Apply pattern in ServiceManagementClient, TraceSegmentServiceClient, MeterSender
- reportError(): re-notify CONNECTED when grpc-js channel stays ready
- Guard scheduleNextReport, flush, heartbeat, and meter timers with closed flag

Signed-off-by: songzhendong <289505773@qq.com>
songzhendong added a commit to songzhendong/skywalking-nodejs that referenced this pull request Jun 27, 2026
…Error recovery

- Capture GRPCChannelManager in prepare(); reportGrpcError() no-ops after shutdown
- Apply pattern in ServiceManagementClient, TraceSegmentServiceClient, MeterSender
- reportError(): re-notify CONNECTED when grpc-js channel stays ready
- Guard scheduleNextReport, flush, heartbeat, and meter timers with closed flag
- Trace stream: always end in finally; EventEmitter off-before-on and closed guard
- Clear GRPCChannelManager listeners on shutdown; replace channelManager! with guards

Signed-off-by: songzhendong <289505773@qq.com>
songzhendong added a commit to songzhendong/skywalking-nodejs that referenced this pull request Jun 27, 2026
…Error recovery

- Capture GRPCChannelManager in prepare(); reportGrpcError() no-ops after shutdown
- Apply pattern in ServiceManagementClient, TraceSegmentServiceClient, MeterSender
- reportError(): re-notify CONNECTED when grpc-js channel stays ready
- Guard scheduleNextReport, flush, heartbeat, and meter timers with closed flag
- Trace stream: always end in finally; EventEmitter off-before-on and closed guard
- Clear GRPCChannelManager listeners on shutdown; replace channelManager! with guards

Signed-off-by: songzhendong <289505773@qq.com>
songzhendong added a commit to songzhendong/skywalking-nodejs that referenced this pull request Jun 27, 2026
…Error recovery

- Capture GRPCChannelManager in prepare(); reportGrpcError() no-ops after shutdown
- Apply pattern in ServiceManagementClient, TraceSegmentServiceClient, MeterSender
- reportError(): re-notify CONNECTED when grpc-js channel stays ready
- Guard scheduleNextReport, flush, heartbeat, and meter timers with closed flag
- Trace stream: always end in finally; EventEmitter off-before-on and closed guard
- Clear GRPCChannelManager listeners on shutdown; replace channelManager! with guards

Signed-off-by: songzhendong <289505773@qq.com>
songzhendong added a commit to songzhendong/skywalking-nodejs that referenced this pull request Jun 27, 2026
…Error recovery

- Capture GRPCChannelManager in prepare(); reportGrpcError() no-ops after shutdown
- Apply pattern in ServiceManagementClient, TraceSegmentServiceClient, MeterSender
- reportError(): re-notify CONNECTED when grpc-js channel stays ready
- Guard scheduleNextReport, flush, heartbeat, and meter timers with closed flag
- Trace stream: always end in finally; EventEmitter off-before-on and closed guard
- Clear GRPCChannelManager listeners on shutdown; replace channelManager! with guards

Signed-off-by: songzhendong <289505773@qq.com>
…Error recovery

- Capture GRPCChannelManager in prepare(); reportGrpcError() no-ops after shutdown
- Apply pattern in ServiceManagementClient, TraceSegmentServiceClient, MeterSender
- reportError(): re-notify CONNECTED when grpc-js channel stays ready
- Guard scheduleNextReport, flush, heartbeat, and meter timers with closed flag
- Trace stream: always end in finally; EventEmitter off-before-on and closed guard
- Clear GRPCChannelManager listeners on shutdown; replace channelManager! with guards

Signed-off-by: songzhendong <289505773@qq.com>
@songzhendong

Copy link
Copy Markdown
Contributor Author

Thank you for taking the time to review the code.
Round-2 fixes are in one commit on feature/nodejs-grpc-v1-runtime-metrics:

Round-2 (blocking / high)

  1. Late RPC callbacks after shutdownGRPCChannelManager cached in prepare(); async paths use reportGrpcError() (no findService(GRPCChannelManager)! in callbacks).
  2. False DISCONNECT on reportError() — if grpc-js channel is still connected, re-notify CONNECTED; otherwise DISCONNECT.

Same commit (self-review hardening + alignment)

Item Change
shutdown() notify(DISCONNECT) before clearing listeners
Lifecycle closed guards on timers / flush / report paths
Trace stream stream.end() try/catch; EventEmitter off before on
Meter stream stream.end() try/catch
Heartbeat Java-style alternation: counter % 10 == 0reportInstanceProperties, else keepAlive only (tick still 20s)
Polish optional listener typing; timer unref() style

@wu-sheng

Copy link
Copy Markdown
Member

Thanks for the update. I rechecked the previous lifecycle issues against d67e866f44b550d4eacd1a304032ea09ad142ba0.

The prior blockers are fixed in my local probes:

  • Original agent.destroy() grpc-js watcher crash no longer reproduces.
  • Late heartbeat, trace, and meter callbacks after shutdown no longer crash when the service registry has been cleared.
  • reportError() no longer leaves local services stuck in DISCONNECT while the shared channel still reports ready.
  • npm run build passes.
  • npx jest tests/runtime/RuntimeMetricsCollector.test.ts --runInBand passes.

I found one remaining compatibility/cost-control bug in config handling.

Deprecated programmatic runtime metric options are declared but ignored

The new config type declares deprecated programmatic aliases such as nvmMetricsReporterActive, but agent.start({ nvmMetricsReporterActive: false }) does not disable runtime metrics. The canonical field already has a default value before finalizeConfig() runs, so applyDeprecatedRuntimeMetricConfig() never copies the deprecated option.

// index.ts
// BUG: options are merged into the already-defaulted singleton config.
// runtimeMetricsReporterActive is already true by default at this point.
Object.assign(config, options);
finalizeConfig(config);
// AgentConfig.ts
// BUG: this only copies the deprecated alias when the canonical field is
// undefined. In the real singleton config it is already defaulted, so
// nvmMetricsReporterActive: false is ignored.
function applyDeprecatedRuntimeMetricConfig(config: AgentConfig): void {
  if (config.runtimeMetricsReporterActive === undefined) {
    if (config.nvmMetricsReporterActive !== undefined) {
      config.runtimeMetricsReporterActive = config.nvmMetricsReporterActive;
    } else if (config.nvmJvmReporterActive !== undefined) {
      config.runtimeMetricsReporterActive = config.nvmJvmReporterActive;
    }
  }
}

I verified this locally:

agent.start({ nvmMetricsReporterActive: false })
runtimeMetricsReporterActive after start true

Env aliases are handled earlier in the default config construction, so this is specifically the programmatic AgentConfig path. For an agent, this matters because runtime metrics create extra timers and collection/report work; a documented/deprecated disable option should not silently fail.

Suggested Node-side fix: normalize deprecated aliases on the incoming options object before merging it into the default singleton, so explicit user options win over defaults.

// Sketch: normalize user-provided options before Object.assign(config, options).
function normalizeDeprecatedRuntimeMetricOptions(options: AgentConfig): AgentConfig {
  if (options.runtimeMetricsReporterActive === undefined) {
    options.runtimeMetricsReporterActive =
      options.nvmMetricsReporterActive ?? options.nvmJvmReporterActive;
  }

  if (options.runtimeMetricsCollectPeriod === undefined) {
    options.runtimeMetricsCollectPeriod =
      options.nvmMetricsCollectPeriod ?? options.nvmJvmMetricsCollectPeriod;
  }

  if (options.runtimeMetricsReportPeriod === undefined) {
    options.runtimeMetricsReportPeriod =
      options.nvmMetricsReportPeriod ?? options.nvmJvmMetricsReportPeriod;
  }

  if (options.runtimeMetricsBufferSize === undefined) {
    options.runtimeMetricsBufferSize =
      options.nvmMetricsBufferSize ?? options.nvmJvmMetricsBufferSize;
  }

  return options;
}

Please also add a small config test for this path:

// Expected behavior: deprecated programmatic disable should still disable
// MeterSender registration unless the canonical option is explicitly set.
agent.start({ nvmMetricsReporterActive: false });
expect(config.runtimeMetricsReporterActive).toBe(false);

@songzhendong songzhendong force-pushed the feature/nodejs-grpc-v1-runtime-metrics branch from 05851b5 to d67e866 Compare June 28, 2026 00:22
songzhendong added a commit to songzhendong/skywalking-nodejs that referenced this pull request Jun 28, 2026
Normalize programmatic deprecated aliases before merging user options into
the singleton config without overwriting defaults with undefined, apply
deprecated fields during finalizeConfig for direct config mutation, and
add unit plus integration coverage for alias mapping, MeterSender
registration, and real ServiceManager loadServices. Run config tests in CI.

Signed-off-by: songzhendong <289505773@qq.com>
songzhendong added a commit to songzhendong/skywalking-nodejs that referenced this pull request Jun 28, 2026
Normalize programmatic deprecated aliases before merging user options into
the singleton config without overwriting defaults with undefined, apply
deprecated fields during finalizeConfig for direct config mutation, and
add unit plus integration coverage for alias mapping and MeterSender
registration through real ServiceManager loadServices.

Signed-off-by: songzhendong <289505773@qq.com>
songzhendong added a commit to songzhendong/skywalking-nodejs that referenced this pull request Jun 28, 2026
Normalize programmatic deprecated aliases before merging user options into
the singleton config without overwriting defaults with undefined, apply
deprecated fields during finalizeConfig for direct config mutation, and
add unit plus integration coverage for alias mapping and MeterSender
registration through real ServiceManager loadServices.

Fork-only: run config tests in TestLib via npx jest tests/config/.

Signed-off-by: songzhendong <289505773@qq.com>
songzhendong added a commit to songzhendong/skywalking-nodejs that referenced this pull request Jun 28, 2026
Normalize programmatic deprecated aliases before merging user options into
the singleton config without overwriting defaults with undefined, apply
deprecated fields during finalizeConfig for direct config mutation, and
add unit tests for alias mapping and agent.start() disable path.

Signed-off-by: songzhendong <289505773@qq.com>
songzhendong added a commit to songzhendong/skywalking-nodejs that referenced this pull request Jun 28, 2026
Normalize programmatic deprecated aliases before merging user options into
the singleton config without overwriting defaults with undefined, apply
deprecated fields during finalizeConfig for direct config mutation, and
add unit tests for alias mapping and agent.start() disable path.

Fork-only: run config tests in TestLib via npx jest tests/config/.

Signed-off-by: songzhendong <289505773@qq.com>
Normalize programmatic deprecated aliases before merging user options into
the singleton config without overwriting defaults with undefined, apply
deprecated fields during finalizeConfig for direct config mutation, and
add unit tests for alias mapping and agent.start() disable path.

Signed-off-by: songzhendong <289505773@qq.com>
@songzhendong

Copy link
Copy Markdown
Contributor Author

Thanks for the Round-3 review
The deprecated programmatic runtime metric config issue is fixed

Fix

  • Normalize deprecated aliases in normalizeDeprecatedRuntimeMetricOptions() before merging user options into the singleton config, so values like nvmMetricsReporterActive: false override the default runtimeMetricsReporterActive: true.
  • Avoid writing undefined into canonical fields when agent.start({}) is called with empty options.
  • Keep applyDeprecatedRuntimeMetricConfig() for deprecated fields set directly on the exported config before agent.start().

Test

Added tests/config/AgentConfig.test.ts, including the path you suggested:

agent.start({ nvmMetricsReporterActive: false });
expect(config.runtimeMetricsReporterActive).toBe(false);
Also covers canonical-over-deprecated priority and programmatic nvmJvm* aliases.

Verification
Local:

npm run build
npx jest tests/config/AgentConfig.test.ts --runInBand
Fork CI (Build / License / Test) is green

@wu-sheng

Copy link
Copy Markdown
Member

Thanks for the update. The previous single-start alias issue is fixed: agent.start({ nvmMetricsReporterActive: false }) now sets runtimeMetricsReporterActive to false, and the new config test covers that.

I found one remaining stale-config edge around restart. normalizeDeprecatedRuntimeMetricOptions() still returns the deprecated alias keys when it promotes them to canonical keys. After the first start, those alias keys stay on the singleton exported config, so they can override a later canonical option during a destroy() / start() cycle.

// First start: deprecated alias is promoted, but nvmMetricsReporterActive
// remains on the singleton config.
agent.start({
  nvmMetricsReporterActive: false,
});

agent.destroy();

// BUG: this canonical option should win on the second start, but the stale
// config.nvmMetricsReporterActive=false from the first start is still present.
agent.start({
  runtimeMetricsReporterActive: true,
});

I verified locally:

first start runtimeMetricsReporterActive false stale alias false
second start runtimeMetricsReporterActive false stale alias false

The cause is this combination:

// normalizeDeprecatedRuntimeMetricOptions() promotes the alias...
if (reporterActive !== undefined) {
  normalized.runtimeMetricsReporterActive = reporterActive;
}

// ...but keeps normalized.nvmMetricsReporterActive on the object.
Object.assign(config, normalizeDeprecatedRuntimeMetricOptions(options));

// Later finalizeConfig() sees the stale alias and overwrites the canonical value.
if (config.nvmMetricsReporterActive !== undefined) {
  config.runtimeMetricsReporterActive = config.nvmMetricsReporterActive;
}

Suggestion: after promoting deprecated aliases, delete the deprecated fields from the normalized options and from the singleton config after applyDeprecatedRuntimeMetricConfig() consumes them. Add a regression test for:

agent.start({ nvmMetricsReporterActive: false });
agent.destroy();
agent.start({ runtimeMetricsReporterActive: true });
expect(config.runtimeMetricsReporterActive).toBe(true);

…ic aliases.

Strip deprecated runtime metric fields after alias promotion in
normalizeDeprecatedRuntimeMetricOptions(), respect explicit canonical
options on restart in applyDeprecatedRuntimeMetricConfig(), and add
regression tests for destroy/start cycles.

Signed-off-by: songzhendong <289505773@qq.com>
@songzhendong

Copy link
Copy Markdown
Contributor Author

Thanks for the Round-4 review.
The stale deprecated alias on destroy() / start() restart is fixed.

Fix

  • After promoting deprecated aliases in normalizeDeprecatedRuntimeMetricOptions(), delete the deprecated fields so they are not merged onto the singleton config.
  • In applyDeprecatedRuntimeMetricConfig(), only apply deprecated aliases when the current agent.start(options) call did not explicitly set the canonical field; then clear deprecated fields from the singleton.
  • Pass options into finalizeConfig(config, options) from agent.start().

Test

Added a regression test in tests/config/AgentConfig.test.ts:

agent.start({ nvmMetricsReporterActive: false });
agent.destroy();
agent.start({ runtimeMetricsReporterActive: true });
expect(config.runtimeMetricsReporterActive).toBe(true);
Also verifies normalized options do not retain stale alias keys.

Verification
Local:

npm run build
npx jest tests/config/AgentConfig.test.ts --runInBand
Fork CI (Build / License / Test) is green 

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the Node.js agent’s remote reporting stack to a Java agent.core-aligned gRPC v1 architecture (single shared channel + unified BootService lifecycle), and adds periodic Node.js runtime meters reported via MeterReportService.

Changes:

  • Replace the old GrpcProtocol / per-client channel approach with ServiceManager + GRPCChannelManager + Java-aligned remote clients.
  • Add six instance_nodejs_* runtime meters with configurable collect/report periods and buffering, including deprecated-config/env alias normalization.
  • Extend tests/docs to cover runtime meter mapping and E2E meterItems assertions.

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
README.md Documents Node.js runtime meters and new env vars (with legacy aliases).
src/agent/core/boot/BootService.ts Replaces the old protocol contract with a Java-style BootService lifecycle interface.
src/agent/core/boot/ServiceManager.ts Adds a central boot/flush/shutdown orchestrator for core services.
src/agent/core/meter/MeterSender.ts Adds runtime meter buffering and gRPC reporting via MeterReportService.
src/agent/core/meter/RuntimeMetricsCollector.ts Maps sampled runtime data into six instance_nodejs_* meters.
src/agent/core/meter/RuntimeSampler.ts Samples CPU + memory runtime snapshot from Node/V8 APIs.
src/agent/core/remote/AgentIDDecorator.ts Adds Agent-Version metadata via a channel interceptor.
src/agent/core/remote/AuthenticationDecorator.ts Adds Authentication metadata via a channel interceptor.
src/agent/core/remote/ChannelBuilder.ts Introduces a builder-chain context for channel credentials/options.
src/agent/core/remote/ChannelBuilder.ts Introduces a builder-chain context for channel credentials/options.
src/agent/core/remote/ChannelDecorator.ts Replaces old metadata helper with a channel decorator (interceptor) interface.
src/agent/core/remote/GRPCChannel.ts Implements a shared channel with builder/decorator chaining and client override options.
src/agent/core/remote/GRPCChannelListener.ts Adds a listener contract for connectivity status changes.
src/agent/core/remote/GRPCChannelManager.ts Adds the shared channel manager and connectivity watching/notification.
src/agent/core/remote/GRPCChannelStatus.ts Replaces the old Client interface with a simple CONNECTED/DISCONNECT status enum.
src/agent/core/remote/ServiceManagementClient.ts Refactors heartbeat/instance properties into a BootService driven by channel status.
src/agent/core/remote/StandardChannelBuilder.ts Adds standard (plaintext) channel options, including message size limits.
src/agent/core/remote/TLSChannelBuilder.ts Adds TLS credential upgrade when SW_AGENT_SECURE=true.
src/agent/core/remote/TraceSegmentServiceClient.ts Refactors trace segment streaming into a BootService driven by channel status.
src/agent/protocol/grpc/GrpcProtocol.ts Removes the old protocol wrapper (replaced by ServiceManager-based bootstrap).
src/agent/protocol/grpc/SegmentObjectAdapter.ts Removes the adapter in favor of Segment.transform().
src/agent/protocol/grpc/clients/HeartbeatClient.ts Removes the old heartbeat client (replaced by ServiceManagementClient).
src/agent/protocol/grpc/clients/TraceReportClient.ts Removes the old trace client (replaced by TraceSegmentServiceClient).
src/config/AgentConfig.ts Adds runtime-metrics config/env parsing + deprecated alias normalization.
src/index.ts Switches agent startup/shutdown/flush from GrpcProtocol to ServiceManager; normalizes deprecated runtime-metrics options.
src/trace/context/Segment.ts Adds Segment.transform() to serialize segments for gRPC reporting.
tests/config/AgentConfig.test.ts Adds unit tests for deprecated runtime-metrics alias normalization behavior.
tests/plugins/express/expected.data.yaml Adds E2E validation expectations for runtime meterItems (server/client).
tests/runtime/RuntimeMetricsCollector.test.ts Adds unit test for meter name mapping and non-negative values.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +160 to +163
const channel = managed.getChannel();
const currentState = channel.getConnectivityState(true);

channel.watchConnectivityState(currentState, Infinity, (error) => {
Comment on lines +111 to +115
return new TraceSegmentReportServiceClient(
config.collectorAddress,
grpc.credentials.createInsecure(),
this.channelManager.getClientOptions(),
);
Comment on lines +146 to +150
return new ManagementServiceClient(
config.collectorAddress,
grpc.credentials.createInsecure(),
this.channelManager.getClientOptions(),
);
Comment on lines +79 to +83
return new MeterReportServiceClient(
config.collectorAddress,
grpc.credentials.createInsecure(),
this.channelManager.getClientOptions(),
);
Comment on lines +159 to +172
let metadataWritten = false;
const timestamp = Date.now();
for (const snapshot of snapshots) {
for (const meterData of this.collector.toMeterData(snapshot)) {
if (!metadataWritten) {
meterData
.setService(config.serviceName)
.setServiceinstance(config.serviceInstance)
.setTimestamp(timestamp);
metadataWritten = true;
}
stream.write(meterData);
}
}
When the shared channel is already READY before the first state transition,
listeners now receive CONNECTED immediately instead of staying DISCONNECT
until a later connectivity change.

Add unit regression test for the already-READY boot path.
@songzhendong

Copy link
Copy Markdown
Contributor Author

Fixed in this update

GRPCChannelManager boot-time connectivity: notify listeners of the current channel state in boot() (via notifyCurrentConnectivityState()), so CONNECTED is delivered when the shared channel is already READY at startup.

Planned for a follow-up (later)

Multi-address / stub target consistency: align client stub targets with GRPCChannelManager.resolveAddress() together with V2 work (multi-backend failover is out of V1 scope; DNS re-resolve is the higher-priority).

No change planned for now

  1. watchConnectivityState(..., Infinity, ...): Infinity means no watch deadline in @grpc/grpc-js; a finite deadline would stop the watcher on timeout.
  2. Meter metadata on every MeterData: Meter.proto specifies service / serviceInstance / timestamp on the first stream element only; the Java MeterSender does the same (hasSendMachineInfo). If we set them on every message instead: larger payloads with no OAP benefit, diverges from the Java agent and proto contract

Test

Added tests/remote/GRPCChannelManager.test.ts:

manager.addChannelListener(listener);
manager.boot();
expect(listener.statusChanged).toHaveBeenCalledWith(GRPCChannelStatus.CONNECTED);

npm run build
npx jest tests/remote/GRPCChannelManager.test.ts --runInBand

@wu-sheng wu-sheng changed the title Add Node.js runtime metrics and align gRPC with Java agent.core Refactor gRPC runtime services and add Node.js runtime metrics Jun 28, 2026
@wu-sheng wu-sheng merged commit 36df516 into apache:master Jun 28, 2026
34 checks passed
songzhendong added a commit to songzhendong/skywalking that referenced this pull request Jun 28, 2026
Update e2e Node.js agent pin after merge of apache/skywalking-nodejs#139
(Refactor gRPC runtime services and add Node.js runtime metrics).
songzhendong added a commit to songzhendong/skywalking that referenced this pull request Jun 28, 2026
Add nodejs-runtime MAL rules, documentation, e2e assertions, and pin
SW_AGENT_NODEJS_COMMIT to apache/skywalking-nodejs master @ 36df516
(merged apache/skywalking-nodejs#139).
songzhendong added a commit to songzhendong/skywalking that referenced this pull request Jun 28, 2026
Add nodejs-runtime MAL rules, documentation, e2e assertions, and pin
SW_AGENT_NODEJS_COMMIT to apache/skywalking-nodejs master @ 36df516
(merged apache/skywalking-nodejs#139).
songzhendong added a commit to songzhendong/skywalking that referenced this pull request Jun 29, 2026
Add nodejs-runtime MAL rules, documentation, e2e assertions, and pin
SW_AGENT_NODEJS_COMMIT to apache/skywalking-nodejs master @ 36df516
(merged apache/skywalking-nodejs#139).
songzhendong added a commit to songzhendong/skywalking that referenced this pull request Jun 29, 2026
Add nodejs-runtime MAL rules, documentation, e2e assertions, and pin
SW_AGENT_NODEJS_COMMIT to apache/skywalking-nodejs master @ 36df516
(merged apache/skywalking-nodejs#139).
songzhendong added a commit to songzhendong/skywalking that referenced this pull request Jun 29, 2026
Add nodejs-runtime MAL rules, documentation, e2e assertions, and pin
SW_AGENT_NODEJS_COMMIT to apache/skywalking-nodejs master @ 36df516
(merged apache/skywalking-nodejs#139).
wu-sheng pushed a commit to apache/skywalking that referenced this pull request Jun 30, 2026
Add nodejs-runtime MAL rules, documentation, e2e assertions, and pin
SW_AGENT_NODEJS_COMMIT to apache/skywalking-nodejs master @ 36df516
(merged apache/skywalking-nodejs#139).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants