Refactor gRPC runtime services and add Node.js runtime metrics#139
Conversation
- Refactor remote reporting to agent.core: single shared gRPC channel, ServiceManager/BootService lifecycle, Builder/Decorator chain - Add six instance_nodejs_* runtime meters via MeterReportService (1s/1s) - Rename env vars to SW_AGENT_NODEJS_RUNTIME_METRICS_* with deprecated aliases
wu-sheng
left a comment
There was a problem hiding this comment.
Thanks for the refactor. I like the direction of moving lifecycle and gRPC management into explicit services, but I don't think this is safe to merge yet. The main issue is that several control-flow pieces are copied from the Java agent model, while the Node agent must be stricter: no blocking waits, no stale async callbacks after shutdown, no stale clients after disconnect, and no duplicated report loops. This code runs inside the user's application event loop, so crash guards and cheap async behavior matter more than Java-style parity.
Blocking: destroy() can crash the instrumented application
GRPCChannelManager.shutdown() closes the grpc-js channel and then sets managedChannel to null, but a grpc-js connectivity watcher can still be called after the close. That callback calls watchConnectivityState() again, which dereferences managedChannel!.
// BUG: shutdown() can trigger grpc-js connectivity watchers, then the
// manager immediately removes the channel those callbacks still reference.
shutdown(): void {
this.managedChannel?.shutdownNow();
this.managedChannel = null;
}
// BUG: this callback can run after shutdown(). The recursive call re-enters
// getChannel(), which dereferences managedChannel!, causing an uncaught error
// in the user's process.
private watchConnectivityState(): void {
const channel = this.getChannel();
const currentState = channel.getConnectivityState(true);
channel.watchConnectivityState(currentState, Infinity, (error) => {
if (error) {
logger.debug('Channel connectivity watch error: %s', error.message);
return;
}
const ready = channel.getConnectivityState(false) === grpc.connectivityState.READY;
this.notify(ready ? GRPCChannelStatus.CONNECTED : GRPCChannelStatus.DISCONNECT);
this.watchConnectivityState();
});
}I reproduced this against this PR:
destroy returned
uncaught TypeError: Cannot read properties of null (reading 'getChannel')
Node-native suggestion: treat grpc callbacks as stale-able async lifecycle events. Keep an explicit closed flag, capture the managed channel for the watcher, and never re-watch after shutdown or after the manager has moved to a different channel.
// Node-native agent style: shutdown is idempotent, and stale grpc-js callbacks
// are ignored instead of being allowed to crash the application.
private closed = false;
shutdown(): void {
this.closed = true;
const managed = this.managedChannel;
this.managedChannel = null;
managed?.shutdownNow();
}
private watchConnectivityState(): void {
const managed = this.managedChannel;
if (this.closed || !managed) {
return;
}
const channel = managed.getChannel();
const currentState = channel.getConnectivityState(true);
channel.watchConnectivityState(currentState, Infinity, (error) => {
if (this.closed || this.managedChannel !== managed) {
return;
}
if (error) {
logger.debug('Channel connectivity watch stopped: %s', error.message);
return;
}
const ready = channel.getConnectivityState(false) === grpc.connectivityState.READY;
this.notify(ready ? GRPCChannelStatus.CONNECTED : GRPCChannelStatus.DISCONNECT);
this.watchConnectivityState();
});
}High: heartbeat keeps using a stale client after disconnect
ServiceManagementClient creates the management client on CONNECTED, but never clears it on DISCONNECT. The heartbeat timer only checks whether the object exists, so once the agent has connected once, periodic management calls can continue through a stale client after disconnect. The calls also don't have per-call deadlines.
// BUG: this only checks whether a client object exists. In Node, object
// existence is not a connection-state guard for async RPC work.
this.heartbeatTimer = setInterval(() => {
if (!this.managementServiceClient) {
return;
}
// BUG: these can still be issued after DISCONNECT, and neither call has a
// deadline. The agent should avoid producing work while disconnected.
this.managementServiceClient.reportInstanceProperties(instanceProperties, new grpc.Metadata(), callback);
this.managementServiceClient.keepAlive(keepAlivePkg, new grpc.Metadata(), callback);
}, 20000).unref();
// BUG: DISCONNECT does not clear the stale client or record current status.
statusChanged(status: GRPCChannelStatus): void {
if (status === GRPCChannelStatus.CONNECTED) {
this.managementServiceClient = this.createManagementClient();
}
}Node-native suggestion: keep explicit local status, clear the client on disconnect, and gate every periodic async task on the current status. Also pass deadlines to management RPCs.
// Node-native agent style: cheap local state prevents unnecessary async RPCs
// while the backend is unavailable.
private status = GRPCChannelStatus.DISCONNECT;
private managementServiceClient?: ManagementServiceClient;
statusChanged(status: GRPCChannelStatus): void {
this.status = status;
this.managementServiceClient =
status === GRPCChannelStatus.CONNECTED ? this.createManagementClient() : undefined;
}
private sendHeartbeat(): void {
if (this.status !== GRPCChannelStatus.CONNECTED || !this.managementServiceClient) {
return;
}
const options = { deadline: Date.now() + config.traceTimeout };
this.managementServiceClient.keepAlive(keepAlivePkg, new grpc.Metadata(), options, (error) => {
if (error) {
logHeartbeatError('Failed to send heartbeat', error);
ServiceManager.INSTANCE.findService(GRPCChannelManager)?.reportError(error);
}
});
}High: reportError() is a no-op, so clients cannot stop work quickly
Several clients call GRPCChannelManager.reportError(error) as if it updates channel state, but the implementation only logs. For a Node agent, failed network calls should quickly flip local state to DISCONNECT so timers stop doing avoidable work until grpc-js reconnects.
// BUG: callers report network errors here, but this does not notify
// DISCONNECT or otherwise affect agent behavior.
reportError(error: unknown): void {
logger.debug('gRPC report error (multi-address failover reserved for V2): %s', error);
}Node-native suggestion: let grpc-js handle transport reconnects, but still use reportError() as the agent's cheap local disconnect signal.
// Node-native agent style: no blocking reconnect loop is needed, but local
// services need DISCONNECT so periodic work can stop immediately.
reportError(error: unknown): void {
if (!isGrpcNetworkError(error)) {
return;
}
this.notify(GRPCChannelStatus.DISCONNECT);
}
function isGrpcNetworkError(error: unknown): boolean {
const code = (error as grpc.ServiceError | undefined)?.code;
return (
code === grpc.status.UNAVAILABLE ||
code === grpc.status.PERMISSION_DENIED ||
code === grpc.status.UNAUTHENTICATED ||
code === grpc.status.RESOURCE_EXHAUSTED ||
code === grpc.status.UNKNOWN
);
}Medium: Java-style blocking wait was ported into Node
GRPCStreamServiceStatus appears unused, but this pattern should not land in Node agent core. If it is called later, it will burn CPU and block the user's application event loop.
// BUG: Java-thread style wait. Node has one application event loop, so this
// blocks the instrumented application, not just the agent.
wait4Finish(): void {
while (!this.status) {
this.try2Sleep(recheckCycle);
}
}
private try2Sleep(millis: number): void {
const end = Date.now() + millis;
while (Date.now() < end) {
// Busy wait: no async yield, no timer, event loop blocked.
}
}Node-native suggestion: remove this unused class. If stream completion needs to be awaited later, use the grpc-js callback or a Promise. No polling, no sleep, no busy wait.
// Node-native stream completion: grpc-js already tells us when the stream is
// done. Await that callback instead of simulating Thread.sleep().
return new Promise<void>((resolve) => {
const stream = client.collect(metadata, { deadline }, (error) => {
if (error) {
logReportError('Failed to report data', error);
channelManager.reportError(error);
}
resolve();
});
for (const item of batch) {
stream.write(item);
}
stream.end();
});Medium: flush() can create extra report timers
flush() manually calls reportFunction(), while the normal scheduled timeout can still be pending. Then reportFunction() always schedules another timeout in finally. Repeated flushes, especially in Lambda-style usage, can multiply report loops.
// BUG/RISK: manual flush runs reportFunction() without cancelling or reusing
// the already scheduled report timer.
flush(): Promise<unknown> | null {
return this.buffer.length === 0
? null
: new Promise((resolve) => {
this.reportFunction(resolve);
});
}
// BUG/RISK: every reportFunction() call schedules another loop, regardless of
// whether one was already scheduled before flush().
finally {
this.timeout = setTimeout(this.reportFunction.bind(this), 1000);
this.timeout.unref();
}Node-native suggestion: keep one scheduled timer and one in-flight report Promise. Frequent flush() calls should reuse or force the same async report work, not create additional loops.
// Node-native agent style: one timer, one in-flight report, explicit reschedule.
private timeout?: NodeJS.Timeout;
private reporting?: Promise<void>;
private scheduleNextReport(): void {
if (this.timeout) {
return;
}
this.timeout = setTimeout(() => {
this.timeout = undefined;
void this.reportOnce().finally(() => this.scheduleNextReport());
}, 1000);
this.timeout.unref();
}
private reportOnce(): Promise<void> {
if (this.reporting) {
return this.reporting;
}
this.reporting = this.doReport().finally(() => {
this.reporting = undefined;
});
return this.reporting;
}
flush(): Promise<unknown> | null {
if (this.timeout) {
clearTimeout(this.timeout);
this.timeout = undefined;
}
return this.buffer.length === 0 ? null : this.reportOnce();
}I did validate a few things that are OK: trace and meter buffers are bounded, report/buffer logs are throttled, the metric metadata-first-element behavior matches the protocol/other agents, and channelOverride means the generated clients do use the shared grpc-js channel. npm run build passed and npx jest tests/runtime/RuntimeMetricsCollector.test.ts --runInBand passed.
- GRPCChannelManager: closed flag and stale connectivity callback guards - GRPCChannelManager: reportError notifies DISCONNECT on network errors - ServiceManagementClient: track status, clear client on disconnect, RPC deadlines - TraceSegmentServiceClient: single report timer/in-flight promise, flush safety - MeterSender: status gating and shared in-flight report promise - Remove unused GRPCStreamServiceStatus busy-wait class
|
@wu-sheng Thank you for the detailed review. Your feedback on Node-agent-specific lifecycle and async behavior was extremely valuable. I have addressed all five issues you raised; the fixes are summarized below. OverviewThis update adds 2 commits on Before updating this PR, I ran the full CI pipeline on my fork (Build / Test / License) — all 34 checks passed. Fix details (5 items)1. Blocking — |
wu-sheng
left a comment
There was a problem hiding this comment.
Thanks for the update. The previous blocker is improved: I reran the original destroy() repro and it no longer crashes, the management client is cleared on DISCONNECT, the Java-style busy-wait helper was removed, and the trace flush loop now has a single in-flight report path. npm run build and npx jest tests/runtime/RuntimeMetricsCollector.test.ts --runInBand both pass locally.
I still see two Node lifecycle issues that should be fixed before merge.
Blocking: late gRPC callbacks can still crash after shutdown
Clearing timers and clients in shutdown() is not enough for grpc-js. RPC callbacks that were already registered can still fire after ServiceManager.shutdown() has cleared bootedServices. Those callbacks still do a required lookup with findService(GRPCChannelManager)!, so a late error callback can throw inside the user's process.
Examples:
// BUG: this callback can run after ServiceManager.shutdown() has cleared the
// registry. findService(...) then returns undefined, and `.reportError` throws.
this.managementServiceClient.reportInstanceProperties(
this.instanceProperties!,
new grpc.Metadata(),
options,
(error) => {
if (error) {
logHeartbeatError('Failed to send heartbeat', error);
ServiceManager.INSTANCE.findService(GRPCChannelManager)!.reportError(error);
}
},
);
// Same issue for the second heartbeat RPC.
this.managementServiceClient.keepAlive(this.keepAlivePkg!, new grpc.Metadata(), options, (error) => {
if (error) {
logHeartbeatError('Failed to send heartbeat', error);
ServiceManager.INSTANCE.findService(GRPCChannelManager)!.reportError(error);
}
});// BUG: trace stream callback can also run after shutdown/registry cleanup.
const stream = this.reporterClient.collect(
new grpc.Metadata(),
{ deadline: Date.now() + config.traceTimeout },
(error) => {
if (error) {
logReportError('Failed to report trace data', error);
ServiceManager.INSTANCE.findService(GRPCChannelManager)!.reportError(error);
}
resolve();
},
);// BUG: meter stream callback and the sync catch block have the same required
// global lookup after async/lifecycle boundaries.
const stream = this.reporterClient.collect(
new grpc.Metadata(),
{ deadline: Date.now() + (config.traceTimeout || 10000) },
(error: grpc.ServiceError | null) => {
if (error) {
logReportError('Failed to report runtime meter data', error);
ServiceManager.INSTANCE.findService(GRPCChannelManager)!.reportError(error);
}
resolve();
},
);I reproduced the issue with fake delayed callbacks by clearing the service registry before the callback fires:
registry cleared
Failed to send heartbeat
uncaught TypeError: Cannot read properties of undefined (reading 'reportError')
and the same pattern reproduces for trace and meter callbacks.
Node-native suggestion: don't resolve agent services through the mutable global registry inside async callbacks. Capture the GRPCChannelManager during prepare(), keep a local closed flag, and make late callbacks no-op after shutdown. In-flight grpc-js calls generally cannot be assumed cancelled just because the agent service shut down.
// Node-native agent style: callbacks can arrive late, so they must be cheap
// and safe after shutdown.
private closed = false;
private channelManager?: GRPCChannelManager;
prepare(): void {
this.channelManager = ServiceManager.INSTANCE.findService(GRPCChannelManager);
this.channelManager?.addChannelListener(this);
}
shutdown(): void {
this.closed = true;
clearInterval(this.heartbeatTimer);
this.heartbeatTimer = undefined;
this.managementServiceClient = undefined;
}
private reportGrpcError(error: unknown): void {
if (this.closed) {
return;
}
this.channelManager?.reportError(error);
}
// Callback uses the safe local helper, not a required global lookup.
(error) => {
if (error) {
logHeartbeatError('Failed to send heartbeat', error);
this.reportGrpcError(error);
}
}The same pattern should be applied in TraceSegmentServiceClient and MeterSender.
High: reportError() can leave services disconnected while grpc-js is still READY
The updated reportError() now sends DISCONNECT, which is useful when the channel actually lost connectivity. But it can also be called for statuses that do not necessarily move the grpc-js channel out of READY (UNKNOWN, PERMISSION_DENIED, UNAUTHENTICATED, etc.). When that happens, listeners clear their clients, but the connectivity watcher may never fire another CONNECTED event because the channel state did not change.
// BUG/RISK: this can notify DISCONNECT even when the grpc-js channel remains
// READY. Since clients are recreated only on CONNECTED watcher events, they can
// stay undefined forever.
reportError(error: unknown): void {
if (!isGrpcNetworkError(error)) {
logger.debug('gRPC report error (ignored): %s', error);
return;
}
logger.debug('gRPC network error, notify DISCONNECT: %s', error);
this.notify(GRPCChannelStatus.DISCONNECT);
}I simulated this state and got:
events CONNECTED,DISCONNECT
manager still sees connected true
Node-native suggestion: use grpc-js connectivity as the source of truth for local connection state, or make reportError() include a re-notification/recheck path. For example, avoid clearing clients if the shared channel is still READY, or schedule an unref'ed connectivity recheck that sends CONNECTED when managedChannel.isConnected(true) is already true.
// Sketch: avoid a permanent local DISCONNECT when the transport is still ready.
reportError(error: unknown): void {
if (!isGrpcNetworkError(error)) {
return;
}
const managed = this.managedChannel;
if (!managed) {
this.notify(GRPCChannelStatus.DISCONNECT);
return;
}
if (managed.isConnected(false)) {
this.notify(GRPCChannelStatus.CONNECTED);
return;
}
this.notify(GRPCChannelStatus.DISCONNECT);
}The exact policy can differ, but the important bit is that reportError() should not create a local disconnected state without a path back to connected when the underlying grpc-js channel never changed state.
Non-blocking cost note: the management client still sends both reportInstanceProperties and keepAlive on every heartbeat. That existed before this update, but for a Node agent it would be cheaper to report static instance properties less frequently and send only keepAlive on most ticks, like the Java agent does.
…Error recovery - Capture GRPCChannelManager in prepare(); reportGrpcError() no-ops after shutdown - Apply pattern in ServiceManagementClient, TraceSegmentServiceClient, MeterSender - reportError(): re-notify CONNECTED when grpc-js channel stays ready - Guard scheduleNextReport, flush, heartbeat, and meter timers with closed flag Signed-off-by: songzhendong <289505773@qq.com>
…Error recovery - Capture GRPCChannelManager in prepare(); reportGrpcError() no-ops after shutdown - Apply pattern in ServiceManagementClient, TraceSegmentServiceClient, MeterSender - reportError(): re-notify CONNECTED when grpc-js channel stays ready - Guard scheduleNextReport, flush, heartbeat, and meter timers with closed flag - Trace stream: always end in finally; EventEmitter off-before-on and closed guard - Clear GRPCChannelManager listeners on shutdown; replace channelManager! with guards Signed-off-by: songzhendong <289505773@qq.com>
…Error recovery - Capture GRPCChannelManager in prepare(); reportGrpcError() no-ops after shutdown - Apply pattern in ServiceManagementClient, TraceSegmentServiceClient, MeterSender - reportError(): re-notify CONNECTED when grpc-js channel stays ready - Guard scheduleNextReport, flush, heartbeat, and meter timers with closed flag - Trace stream: always end in finally; EventEmitter off-before-on and closed guard - Clear GRPCChannelManager listeners on shutdown; replace channelManager! with guards Signed-off-by: songzhendong <289505773@qq.com>
…Error recovery - Capture GRPCChannelManager in prepare(); reportGrpcError() no-ops after shutdown - Apply pattern in ServiceManagementClient, TraceSegmentServiceClient, MeterSender - reportError(): re-notify CONNECTED when grpc-js channel stays ready - Guard scheduleNextReport, flush, heartbeat, and meter timers with closed flag - Trace stream: always end in finally; EventEmitter off-before-on and closed guard - Clear GRPCChannelManager listeners on shutdown; replace channelManager! with guards Signed-off-by: songzhendong <289505773@qq.com>
…Error recovery - Capture GRPCChannelManager in prepare(); reportGrpcError() no-ops after shutdown - Apply pattern in ServiceManagementClient, TraceSegmentServiceClient, MeterSender - reportError(): re-notify CONNECTED when grpc-js channel stays ready - Guard scheduleNextReport, flush, heartbeat, and meter timers with closed flag - Trace stream: always end in finally; EventEmitter off-before-on and closed guard - Clear GRPCChannelManager listeners on shutdown; replace channelManager! with guards Signed-off-by: songzhendong <289505773@qq.com>
…Error recovery - Capture GRPCChannelManager in prepare(); reportGrpcError() no-ops after shutdown - Apply pattern in ServiceManagementClient, TraceSegmentServiceClient, MeterSender - reportError(): re-notify CONNECTED when grpc-js channel stays ready - Guard scheduleNextReport, flush, heartbeat, and meter timers with closed flag - Trace stream: always end in finally; EventEmitter off-before-on and closed guard - Clear GRPCChannelManager listeners on shutdown; replace channelManager! with guards Signed-off-by: songzhendong <289505773@qq.com>
|
Thank you for taking the time to review the code. Round-2 (blocking / high)
Same commit (self-review hardening + alignment)
|
|
Thanks for the update. I rechecked the previous lifecycle issues against The prior blockers are fixed in my local probes:
I found one remaining compatibility/cost-control bug in config handling. Deprecated programmatic runtime metric options are declared but ignoredThe new config type declares deprecated programmatic aliases such as // index.ts
// BUG: options are merged into the already-defaulted singleton config.
// runtimeMetricsReporterActive is already true by default at this point.
Object.assign(config, options);
finalizeConfig(config);// AgentConfig.ts
// BUG: this only copies the deprecated alias when the canonical field is
// undefined. In the real singleton config it is already defaulted, so
// nvmMetricsReporterActive: false is ignored.
function applyDeprecatedRuntimeMetricConfig(config: AgentConfig): void {
if (config.runtimeMetricsReporterActive === undefined) {
if (config.nvmMetricsReporterActive !== undefined) {
config.runtimeMetricsReporterActive = config.nvmMetricsReporterActive;
} else if (config.nvmJvmReporterActive !== undefined) {
config.runtimeMetricsReporterActive = config.nvmJvmReporterActive;
}
}
}I verified this locally: Env aliases are handled earlier in the default config construction, so this is specifically the programmatic Suggested Node-side fix: normalize deprecated aliases on the incoming // Sketch: normalize user-provided options before Object.assign(config, options).
function normalizeDeprecatedRuntimeMetricOptions(options: AgentConfig): AgentConfig {
if (options.runtimeMetricsReporterActive === undefined) {
options.runtimeMetricsReporterActive =
options.nvmMetricsReporterActive ?? options.nvmJvmReporterActive;
}
if (options.runtimeMetricsCollectPeriod === undefined) {
options.runtimeMetricsCollectPeriod =
options.nvmMetricsCollectPeriod ?? options.nvmJvmMetricsCollectPeriod;
}
if (options.runtimeMetricsReportPeriod === undefined) {
options.runtimeMetricsReportPeriod =
options.nvmMetricsReportPeriod ?? options.nvmJvmMetricsReportPeriod;
}
if (options.runtimeMetricsBufferSize === undefined) {
options.runtimeMetricsBufferSize =
options.nvmMetricsBufferSize ?? options.nvmJvmMetricsBufferSize;
}
return options;
}Please also add a small config test for this path: // Expected behavior: deprecated programmatic disable should still disable
// MeterSender registration unless the canonical option is explicitly set.
agent.start({ nvmMetricsReporterActive: false });
expect(config.runtimeMetricsReporterActive).toBe(false); |
05851b5 to
d67e866
Compare
Normalize programmatic deprecated aliases before merging user options into the singleton config without overwriting defaults with undefined, apply deprecated fields during finalizeConfig for direct config mutation, and add unit plus integration coverage for alias mapping, MeterSender registration, and real ServiceManager loadServices. Run config tests in CI. Signed-off-by: songzhendong <289505773@qq.com>
Normalize programmatic deprecated aliases before merging user options into the singleton config without overwriting defaults with undefined, apply deprecated fields during finalizeConfig for direct config mutation, and add unit plus integration coverage for alias mapping and MeterSender registration through real ServiceManager loadServices. Signed-off-by: songzhendong <289505773@qq.com>
Normalize programmatic deprecated aliases before merging user options into the singleton config without overwriting defaults with undefined, apply deprecated fields during finalizeConfig for direct config mutation, and add unit plus integration coverage for alias mapping and MeterSender registration through real ServiceManager loadServices. Fork-only: run config tests in TestLib via npx jest tests/config/. Signed-off-by: songzhendong <289505773@qq.com>
Normalize programmatic deprecated aliases before merging user options into the singleton config without overwriting defaults with undefined, apply deprecated fields during finalizeConfig for direct config mutation, and add unit tests for alias mapping and agent.start() disable path. Signed-off-by: songzhendong <289505773@qq.com>
Normalize programmatic deprecated aliases before merging user options into the singleton config without overwriting defaults with undefined, apply deprecated fields during finalizeConfig for direct config mutation, and add unit tests for alias mapping and agent.start() disable path. Fork-only: run config tests in TestLib via npx jest tests/config/. Signed-off-by: songzhendong <289505773@qq.com>
Normalize programmatic deprecated aliases before merging user options into the singleton config without overwriting defaults with undefined, apply deprecated fields during finalizeConfig for direct config mutation, and add unit tests for alias mapping and agent.start() disable path. Signed-off-by: songzhendong <289505773@qq.com>
|
Thanks for the Round-3 review Fix
TestAdded agent.start({ nvmMetricsReporterActive: false });
expect(config.runtimeMetricsReporterActive).toBe(false);
Also covers canonical-over-deprecated priority and programmatic nvmJvm* aliases.
Verification
Local:
npm run build
npx jest tests/config/AgentConfig.test.ts --runInBand
Fork CI (Build / License / Test) is green |
|
Thanks for the update. The previous single-start alias issue is fixed: I found one remaining stale-config edge around restart. // First start: deprecated alias is promoted, but nvmMetricsReporterActive
// remains on the singleton config.
agent.start({
nvmMetricsReporterActive: false,
});
agent.destroy();
// BUG: this canonical option should win on the second start, but the stale
// config.nvmMetricsReporterActive=false from the first start is still present.
agent.start({
runtimeMetricsReporterActive: true,
});I verified locally: The cause is this combination: // normalizeDeprecatedRuntimeMetricOptions() promotes the alias...
if (reporterActive !== undefined) {
normalized.runtimeMetricsReporterActive = reporterActive;
}
// ...but keeps normalized.nvmMetricsReporterActive on the object.
Object.assign(config, normalizeDeprecatedRuntimeMetricOptions(options));
// Later finalizeConfig() sees the stale alias and overwrites the canonical value.
if (config.nvmMetricsReporterActive !== undefined) {
config.runtimeMetricsReporterActive = config.nvmMetricsReporterActive;
}Suggestion: after promoting deprecated aliases, delete the deprecated fields from the normalized options and from the singleton config after agent.start({ nvmMetricsReporterActive: false });
agent.destroy();
agent.start({ runtimeMetricsReporterActive: true });
expect(config.runtimeMetricsReporterActive).toBe(true); |
…ic aliases. Strip deprecated runtime metric fields after alias promotion in normalizeDeprecatedRuntimeMetricOptions(), respect explicit canonical options on restart in applyDeprecatedRuntimeMetricConfig(), and add regression tests for destroy/start cycles. Signed-off-by: songzhendong <289505773@qq.com>
|
Thanks for the Round-4 review. Fix
TestAdded a regression test in agent.start({ nvmMetricsReporterActive: false });
agent.destroy();
agent.start({ runtimeMetricsReporterActive: true });
expect(config.runtimeMetricsReporterActive).toBe(true);
Also verifies normalized options do not retain stale alias keys.
Verification
Local:
npm run build
npx jest tests/config/AgentConfig.test.ts --runInBand
Fork CI (Build / License / Test) is green |
There was a problem hiding this comment.
Pull request overview
This PR refactors the Node.js agent’s remote reporting stack to a Java agent.core-aligned gRPC v1 architecture (single shared channel + unified BootService lifecycle), and adds periodic Node.js runtime meters reported via MeterReportService.
Changes:
- Replace the old
GrpcProtocol/ per-client channel approach withServiceManager+GRPCChannelManager+ Java-aligned remote clients. - Add six
instance_nodejs_*runtime meters with configurable collect/report periods and buffering, including deprecated-config/env alias normalization. - Extend tests/docs to cover runtime meter mapping and E2E
meterItemsassertions.
Reviewed changes
Copilot reviewed 28 out of 28 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| README.md | Documents Node.js runtime meters and new env vars (with legacy aliases). |
| src/agent/core/boot/BootService.ts | Replaces the old protocol contract with a Java-style BootService lifecycle interface. |
| src/agent/core/boot/ServiceManager.ts | Adds a central boot/flush/shutdown orchestrator for core services. |
| src/agent/core/meter/MeterSender.ts | Adds runtime meter buffering and gRPC reporting via MeterReportService. |
| src/agent/core/meter/RuntimeMetricsCollector.ts | Maps sampled runtime data into six instance_nodejs_* meters. |
| src/agent/core/meter/RuntimeSampler.ts | Samples CPU + memory runtime snapshot from Node/V8 APIs. |
| src/agent/core/remote/AgentIDDecorator.ts | Adds Agent-Version metadata via a channel interceptor. |
| src/agent/core/remote/AuthenticationDecorator.ts | Adds Authentication metadata via a channel interceptor. |
| src/agent/core/remote/ChannelBuilder.ts | Introduces a builder-chain context for channel credentials/options. |
| src/agent/core/remote/ChannelBuilder.ts | Introduces a builder-chain context for channel credentials/options. |
| src/agent/core/remote/ChannelDecorator.ts | Replaces old metadata helper with a channel decorator (interceptor) interface. |
| src/agent/core/remote/GRPCChannel.ts | Implements a shared channel with builder/decorator chaining and client override options. |
| src/agent/core/remote/GRPCChannelListener.ts | Adds a listener contract for connectivity status changes. |
| src/agent/core/remote/GRPCChannelManager.ts | Adds the shared channel manager and connectivity watching/notification. |
| src/agent/core/remote/GRPCChannelStatus.ts | Replaces the old Client interface with a simple CONNECTED/DISCONNECT status enum. |
| src/agent/core/remote/ServiceManagementClient.ts | Refactors heartbeat/instance properties into a BootService driven by channel status. |
| src/agent/core/remote/StandardChannelBuilder.ts | Adds standard (plaintext) channel options, including message size limits. |
| src/agent/core/remote/TLSChannelBuilder.ts | Adds TLS credential upgrade when SW_AGENT_SECURE=true. |
| src/agent/core/remote/TraceSegmentServiceClient.ts | Refactors trace segment streaming into a BootService driven by channel status. |
| src/agent/protocol/grpc/GrpcProtocol.ts | Removes the old protocol wrapper (replaced by ServiceManager-based bootstrap). |
| src/agent/protocol/grpc/SegmentObjectAdapter.ts | Removes the adapter in favor of Segment.transform(). |
| src/agent/protocol/grpc/clients/HeartbeatClient.ts | Removes the old heartbeat client (replaced by ServiceManagementClient). |
| src/agent/protocol/grpc/clients/TraceReportClient.ts | Removes the old trace client (replaced by TraceSegmentServiceClient). |
| src/config/AgentConfig.ts | Adds runtime-metrics config/env parsing + deprecated alias normalization. |
| src/index.ts | Switches agent startup/shutdown/flush from GrpcProtocol to ServiceManager; normalizes deprecated runtime-metrics options. |
| src/trace/context/Segment.ts | Adds Segment.transform() to serialize segments for gRPC reporting. |
| tests/config/AgentConfig.test.ts | Adds unit tests for deprecated runtime-metrics alias normalization behavior. |
| tests/plugins/express/expected.data.yaml | Adds E2E validation expectations for runtime meterItems (server/client). |
| tests/runtime/RuntimeMetricsCollector.test.ts | Adds unit test for meter name mapping and non-negative values. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const channel = managed.getChannel(); | ||
| const currentState = channel.getConnectivityState(true); | ||
|
|
||
| channel.watchConnectivityState(currentState, Infinity, (error) => { |
| return new TraceSegmentReportServiceClient( | ||
| config.collectorAddress, | ||
| grpc.credentials.createInsecure(), | ||
| this.channelManager.getClientOptions(), | ||
| ); |
| return new ManagementServiceClient( | ||
| config.collectorAddress, | ||
| grpc.credentials.createInsecure(), | ||
| this.channelManager.getClientOptions(), | ||
| ); |
| return new MeterReportServiceClient( | ||
| config.collectorAddress, | ||
| grpc.credentials.createInsecure(), | ||
| this.channelManager.getClientOptions(), | ||
| ); |
| let metadataWritten = false; | ||
| const timestamp = Date.now(); | ||
| for (const snapshot of snapshots) { | ||
| for (const meterData of this.collector.toMeterData(snapshot)) { | ||
| if (!metadataWritten) { | ||
| meterData | ||
| .setService(config.serviceName) | ||
| .setServiceinstance(config.serviceInstance) | ||
| .setTimestamp(timestamp); | ||
| metadataWritten = true; | ||
| } | ||
| stream.write(meterData); | ||
| } | ||
| } |
When the shared channel is already READY before the first state transition, listeners now receive CONNECTED immediately instead of staying DISCONNECT until a later connectivity change. Add unit regression test for the already-READY boot path.
Fixed in this updateGRPCChannelManager boot-time connectivity: notify listeners of the current channel state in boot() (via notifyCurrentConnectivityState()), so CONNECTED is delivered when the shared channel is already READY at startup. Planned for a follow-up (later)Multi-address / stub target consistency: align client stub targets with GRPCChannelManager.resolveAddress() together with V2 work (multi-backend failover is out of V1 scope; DNS re-resolve is the higher-priority). No change planned for now
TestAdded manager.addChannelListener(listener);
manager.boot();
expect(listener.statusChanged).toHaveBeenCalledWith(GRPCChannelStatus.CONNECTED);
npm run build
npx jest tests/remote/GRPCChannelManager.test.ts --runInBand |
Update e2e Node.js agent pin after merge of apache/skywalking-nodejs#139 (Refactor gRPC runtime services and add Node.js runtime metrics).
Add nodejs-runtime MAL rules, documentation, e2e assertions, and pin SW_AGENT_NODEJS_COMMIT to apache/skywalking-nodejs master @ 36df516 (merged apache/skywalking-nodejs#139).
Add nodejs-runtime MAL rules, documentation, e2e assertions, and pin SW_AGENT_NODEJS_COMMIT to apache/skywalking-nodejs master @ 36df516 (merged apache/skywalking-nodejs#139).
Add nodejs-runtime MAL rules, documentation, e2e assertions, and pin SW_AGENT_NODEJS_COMMIT to apache/skywalking-nodejs master @ 36df516 (merged apache/skywalking-nodejs#139).
Add nodejs-runtime MAL rules, documentation, e2e assertions, and pin SW_AGENT_NODEJS_COMMIT to apache/skywalking-nodejs master @ 36df516 (merged apache/skywalking-nodejs#139).
Add nodejs-runtime MAL rules, documentation, e2e assertions, and pin SW_AGENT_NODEJS_COMMIT to apache/skywalking-nodejs master @ 36df516 (merged apache/skywalking-nodejs#139).
Add nodejs-runtime MAL rules, documentation, e2e assertions, and pin SW_AGENT_NODEJS_COMMIT to apache/skywalking-nodejs master @ 36df516 (merged apache/skywalking-nodejs#139).
CI status (fork validation, commit
481b290)All checks passed on fork before upstream submission (2026-06-26, 34/34):
Key job (runtime meter E2E):
meterItemsasserts server + client × 6 meters each: https://github.com/songzhendong/skywalking-nodejs/actions/runs/28235029569/job/83647759105Fork PR checks: https://github.com/songzhendong/skywalking-nodejs/pull/2/checks
OAP dependency (not in this PR)
Runtime meters are reported via
MeterReportService, but OAP consumption requires a separate upstream PR adding thenodejs-runtimeanalyzer (nodejs-runtime.yaml, 6 MAL rules). Agent-side reporting and plugin E2EmeterItemsare covered here.Summary
This PR does two things:
agent.core.*: single Channel, unifiedBootServicelifecycle, Builder/Decorator chain; Trace / Heartbeat / Meter share one code path for cross-language alignment.instance_nodejs_*process meters viaMeterReportService(1s collect / 1s report).Not in this PR: release scripts, version bump,
CONTRIBUTING/ How-to-release, httpbin test infra, OAP analyzer.Motivation
Before refactor, Node used a standalone
GrpcProtocolwith a different structure from Javaagent.core. This PR converges remote reporting onto the same architecture as Java: one shared gRPC channel, unified lifecycle and extension points, lower cross-language maintenance cost, and a path consistent with Java for failover, TLS, Event/Log, etc.GrpcProtocol+ two Channels (Trace / Heartbeat each own connection)GRPCChannelManager(same as Java / Go)protocol/grpc/clients/*naming diverges from Javacore/remote/*class names and roles map 1:1 to JavaSegmentObjectAdapterSegment.transform()— same pattern as JavaTraceSegment.transform()prepare → boot → shutdownlifecyclev2 out of scope: multi-address failover, full TLS CA/mTLS, Event/Log clients.
Architecture overview
Before (Node old) — dual Channel, no central Channel manager
After (Node v1) — single Channel, Java-aligned
Java — single Channel + SPI registration
flowchart TB subgraph nodeBefore["Node before refactor"] GP[GrpcProtocol] --> HB[HeartbeatClient Ch1] GP --> TR[TraceReportClient Ch2] end subgraph nodeV1["Node v1"] SM[ServiceManager] --> CM[GRPCChannelManager x1] CM --> TC[TraceSegmentServiceClient] CM --> MC[ServiceManagementClient] CM --> MS[MeterSender] end subgraph javaAgent["Java agent"] JSM[ServiceManager SPI] --> JCM[GRPCChannelManager x1] JCM --> JTC[TraceSegmentServiceClient] JCM --> JMC[ServiceManagementClient] JCM --> JMS[MeterSender] JCM --> JEV[EventReportServiceClient] JCM --> JLG[LogReportServiceClient] end nodeBefore ~~~ nodeV1 ~~~ javaAgentAlignment with Java Agent
agent/core/{boot,remote,meter}/agent.core.{boot,remote,meter}/ServiceManager.INSTANCE.boot()ServiceManager.INSTANCE.boot()BootService(prepare/boot/shutdown/priority)ServiceLoader+ SPISegment.transform()TraceSegment.transform()ServiceManager.flush/shutdownServiceManagerunified dispatchGRPCChannelManagerGRPCChannelListenerCONNECTED/DISCONNECTreportError()stub (debug log)reportError → reconnectGRPC_CHANNEL_CHECK_INTERVALFORCE_RECONNECTION_PERIOD, etc.SW_AGENT_SECURETLSChannelBuilder→createSsl()TLSChannelBuilderchainagent/ca/+ cert path configsimple/ssl,simple/mtlsTraceSegmentServiceClientTraceSegmentServiceClientServiceManagementClient(20s)ServiceManagementClientAuthenticationDecorator+AgentIDDecoratorEventReportServiceClient/LogReportServiceClientGRPCStreamServiceStatusported, not wiredMeterSender(BootService + shared Channel)MeterSender(BootService + shared Channel)MeterReportService(generic Meter protobuf)JVMMetricReportService(JVM protobuf)instance_nodejs_*(CPU + memory)MeterSingleValuegaugeMeterFactory/MeterService(≤500)SW_AGENT_NODEJS_RUNTIME_METRICS_BUFFER_SIZE= 600Config.Jvm.BUFFER_SIZE= 600timestampper stream batchJVMMetrichas its own timeofferfull → drop oldest; send failure not re-queuedprocess.cpuUsage()÷ logical coresnodejs-runtime.yaml(6 MAL rules)RuntimeMetricsCollector.test.tsmeterItems(server + client, 6 meters each)Design note: Node runtime metrics use
MeterReportService(same as Go/Python), not JavaJVMMetricReportService.Six meters:
instance_nodejs_process_cpu,instance_nodejs_heap_used,instance_nodejs_heap_total,instance_nodejs_heap_limit,instance_nodejs_rss,instance_nodejs_external_memoryFile / class mapping (before → v1 → Java)
GrpcProtocol.tsServiceManager.tsServiceManager.javaProtocol.tsBootService.tsBootService.javaGRPCChannelManager.tsGRPCChannelManager.javaGRPCChannel.tsGRPCChannel.javaStandardChannelBuilder.tsStandardChannelBuilder.javaTLSChannelBuilder.tsTLSChannelBuilder.javaAuthInterceptor.tsAuthenticationDecorator.tsAuthenticationDecorator.javaAgentIDDecorator.tsAgentIDDecorator.javaClient.tsGRPCChannelStatus.ts+ ListenerGRPCChannelStatus.java+ ListenerTraceReportClient.tsTraceSegmentServiceClient.tsTraceSegmentServiceClient.javaHeartbeatClient.tsServiceManagementClient.tsServiceManagementClient.javaSegmentObjectAdapter.tsSegment.transform()TraceSegment.transform()meter/{RuntimeSampler,RuntimeMetricsCollector,MeterSender}.tsmeter/{MeterSender,MeterService,...}.javaEventReportServiceClient.javaLogReportServiceClient.javagRPC v1 implemented
Boot lifecycle
src/agent/core/boot/BootService.tsprepare()→boot()→onComplete()→shutdown(),priority()orderingsrc/agent/core/boot/ServiceManager.tsBootService; unifiedboot()/shutdown()/flush()src/index.tsGrpcProtocol().heartbeat().report()withServiceManagerServices registered by
ServiceManager(hard-coded, no Java SPI):GRPCChannelManagerTraceSegmentServiceClientServiceManagementClientMeterSenderruntimeMetricsReporterActive=truegRPC Channel stack (
remotepackage)GRPCChannelManager.tsGRPCChannelManager.javawatchConnectivityStatenotifies listeners;reportError()debug log onlyGRPCChannel.tsGRPCChannel.javachannelOverride/ interceptorsChannelBuilder.tsChannelBuilder.javaStandardChannelBuilder.tsStandardChannelBuilder.javaTLSChannelBuilder.tsTLSChannelBuilder.javaSW_AGENT_SECURE=true→grpc.credentials.createSsl()ChannelDecorator.tsChannelDecorator.javaAgentIDDecorator.tsAgentIDDecorator.javaAuthenticationDecorator.tsAuthenticationDecorator.javaGRPCChannelListener.tsGRPCChannelListener.javaGRPCChannelStatus.tsClient.tsenum)Remote clients (listener-driven stubs)
TraceSegmentServiceClient.tsTraceReportClientTraceSegmentReportServicereportError()on failureServiceManagementClient.tsHeartbeatClientManagementServiceSegment serialization: new
Segment.transform(): SegmentObjectinsrc/trace/context/Segment.ts; removedSegmentObjectAdapter.ts(same pattern as JavaTraceSegment.transform()).Runtime meter (shared Channel)
RuntimeSampler.tsmemoryUsage+v8.getHeapStatistics()+cpuUsageRuntimeMetricsCollector.tsMeterSender.tsBootService+GRPCChannelListener; 1s collect / 1s report (configurable);MeterReportService.collectstreaminginstance_nodejs_process_cpuinstance_nodejs_heap_usedinstance_nodejs_heap_totalinstance_nodejs_heap_limitinstance_nodejs_rssinstance_nodejs_external_memoryConfiguration
New environment variables:
SW_AGENT_NODEJS_RUNTIME_METRICS_REPORTER_ACTIVEtrue(explicitfalsedisables)SW_AGENT_NODEJS_RUNTIME_METRICS_COLLECT_PERIOD1000msSW_AGENT_NODEJS_RUNTIME_METRICS_REPORT_PERIOD1000ms (1s)SW_AGENT_NODEJS_RUNTIME_METRICS_BUFFER_SIZE600Deprecated aliases (still readable):
SW_AGENT_RUNTIME_METRICS_*,SW_AGENT_NVM_METRICS_*,SW_AGENT_NVM_JVM_*.Existing gRPC settings (semantics unchanged):
SW_AGENT_COLLECTOR_BACKEND_SERVICES— v1 uses first comma-separated address onlySW_AGENT_SECURE— enables TLS ChannelBuilderSW_AGENT_AUTHENTICATION— OAP authenticationRemoved / replaced Node-specific abstractions
src/agent/protocol/grpc/GrpcProtocol.tsServiceManager+ clientssrc/agent/protocol/grpc/SegmentObjectAdapter.tsSegment.transform()src/agent/protocol/grpc/clients/*src/agent/core/remote/*(renamed + refactored)src/agent/protocol/Protocol.tssrc/agent/core/boot/BootService.tsEntire
src/agent/protocol/directory removed.Tests and docs
tests/runtime/RuntimeMetricsCollector.test.ts— 6 meter names and non-negative valuestests/plugins/express/expected.data.yaml— only addedmeterItems(one block with server + client entries, 6 meters each); httpbin / docker-compose unchangedTLS v1 scope
SW_AGENT_SECURE=true→TLSChannelBuilder→grpc.credentials.createSsl()(system default trust store)collector:11800); self-signed CA / mTLS need v2SW_AGENT_CA_PATH(NODE_EXTRA_CA_CERTSineffective for@grpc/grpc-js)Manual smoke test (2026-06-26, maintainer-local — not in CI):
:11811SW_CORE_GRPC_SSL_ENABLED=true, certs fromtls_key_generate.sh(SERVER_CN=localhost)SW_AGENT_SECURE=true,SW_AGENT_COLLECTOR_BACKEND_SERVICES=localhost:11811openssl s_client -connect 127.0.0.1:11811 -servername localhost, ALPNh2getAllServicesgetServiceInstances(Agent running ≥25s, UTC window)Procedure: start TLS OAP → Agent (
secure: true) → local HTTP request →agent.flush()→ GraphQL check services and instances. Self-signed CA required loading root CA intocreateSsl(rootCerts)(temporary patch, not in this PR); public/enterprise CA works withSW_AGENT_SECURE=truealone. TLS verification should be covered by CI E2E in a follow-up (v2).Out of scope (v2)
Channel failover and reconnect
Per
GRPCChannelManager.tscomments: multi-address backend rotation,reportError()-driven failover, periodic Channel check, forced reconnect, periodic DNS re-resolve.TLS / mTLS
Relative to Java
TLSChannelBuilder: custom CA path, mTLS client certs,SW_AGENT_CA_PATH(or equivalent), self-signed TLS E2E CI.Note:
NODE_EXTRA_CA_CERTSis ineffective for@grpc/grpc-js; must pass CA explicitly tocreateSsl(ca).Remote clients not ported
EventReportServiceClient(proto exists, no agent client)LogReportServiceClient(proto exists, no agent client)Skeleton / unwired code
GRPCStreamServiceStatus.ts: ported from Java, not referenced by Trace/Meter clients yetServiceLoaderSPI: Node hard-codes 4 services, no plugin discoveryTest and API gaps
instance_nodejs_active_handlesnot in the current 6-meter setRuntime requirements