Add fork safety and cooperative cancellation to native trace exporter#5835
Add fork safety and cooperative cancellation to native trace exporter#5835lloeki wants to merge 18 commits into
Conversation
🎉 All green!🧪 All tests passed 🔗 Commit SHA: 5a66cab | Docs | Datadog PR Page | Give us feedback! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fa6e0db5c2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| rb_thread_call_without_gvl2( | ||
| send_chunks_without_gvl, &args, | ||
| RUBY_UBF_IO, NULL); | ||
| interrupt_exporter_call, &cancel_token); |
There was a problem hiding this comment.
Propagate interrupts after cooperative cancellation
When Thread#kill or shutdown interrupts an in-flight send, the new UBF cancels the token and can make the Rust call return with args.send_ran == true; this loop then skips check_if_pending_exception() and falls through to create a transport error response instead of reliably raising the pending interrupt. In that scenario a writer thread that was killed during a native send may continue running after cancellation, so the pending exception should be checked after the GVL call once native response cleanup is safe, not only when the send never started.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Such a tiny window here for this to happen in send_chunks_without_gvl:
args->send_ran = true;
return NULL;(give or take other internals in rb_thread_call_without_gvl2)
But I guess that's an accurate concern.
Strech
left a comment
There was a problem hiding this comment.
LGTM, just a question on reporting it
| Core::Utils::AtForkMonkeyPatch.at_fork(:child) do | ||
| exporter._native_after_fork_in_child | ||
| rescue => e | ||
| Datadog.logger.warn { "Native transport after-fork reset failed: #{e}" } |
There was a problem hiding this comment.
Is this a warning? Or that means - no traces for the fork child?
There was a problem hiding this comment.
Looking the implementation of _native_after_fork_in_child, this looks like just a defensive rescue for uncommon failures like "TraceExporter has not been initialized or was already freed", which are errors likely "impossible" to happen on a well coded implementation (aka exceptions caught here should happen during code changes in development, not in production). I think this error message is just for ourselves really, since we don't expect this to fail.
But, I agree that we should log what would happen if it failed, something like: "Native transport after-fork reset failed. Traces might not be send to Datadog: "
There was a problem hiding this comment.
Exactly, I think it makes sense at least to explain the consequences of that error in the warning.
vpellan
left a comment
There was a problem hiding this comment.
Wondering the same question as Sergey otherwise LGTM
marcotc
left a comment
There was a problem hiding this comment.
Only #5835 (comment) needs to be acked/addressed.
This a blocker, all hooks need to be called. Otherwise the child may inherit a locked mutex causing deadlocks when dropping ressources. In addition some system ressources need to be dropped before the fork or may cause panics when dropped in the child (kqueue handle on macos). Another requirement of libdatadog functions is that the rust functions are not interrupted by a fork. In dd-trace-py this is handled by joining on all threads that run rust functions releasing the GIL (e.g. |
0ba47a9 to
67d68aa
Compare
cb933d2 to
66f6fa8
Compare
VianneyRuhlmann
left a comment
There was a problem hiding this comment.
Looking good for libdatadog usage
66f6fa8 to
fcaad47
Compare
b5a73b3 to
1e01616
Compare
fcaad47 to
ff1515f
Compare
24d215a to
21bce48
Compare
ff1515f to
cecbc93
Compare
21bce48 to
ba15a39
Compare
cecbc93 to
3c30f51
Compare
Expose `_native_before_fork`, `_native_after_fork_in_parent`, and `_native_after_fork_in_child` instance methods that delegate to libdatadog's SharedRuntime fork hooks. These coordinate the tokio runtime lifecycle around process forks (Puma, Unicorn, Passenger).
Create a cancellation token per send call and pass it to the custom unblock function. When Ruby interrupts the thread (shutdown, Thread#kill), the UBF cancels the token, which cooperatively aborts the in-flight HTTP request in the Rust runtime. This replaces the signal-based RUBY_UBF_IO which could not actually cancel the Rust HTTP pipeline.
Register a `:child` callback that calls `_native_after_fork_in_child` on the exporter to recreate the tokio runtime in forked child processes. Without this, the Rust runtime is dead after fork and subsequent send calls would hang or fail. The `AtForkMonkeyPatch` only supports `:child` stage, so `before_fork` and `after_fork_in_parent` are not called. The child path is the critical one: it creates a fresh runtime regardless of whether the parent was prepared.
The trace-exporter FFI was redesigned in libdatadog: - The cancellation token is now an opaque heap object instead of a stack-allocated Handle struct. Obtain it with ddog_trace_exporter_cancel_token_new, pass the pointer to ddog_trace_exporter_send_trace_chunks and the unblock function, and release it with ddog_trace_exporter_cancel_token_drop. - The dedicated ddog_trace_exporter_before_fork/_after_fork_in_parent/ _after_fork_in_child hooks were removed. Fork safety now goes through the generic shared-runtime FFI. The exporter wrapper owns a SharedRuntime created with ddog_shared_runtime_new and attached to the config via ddog_trace_exporter_config_set_shared_runtime before the exporter is built. The fork hooks drive ddog_shared_runtime_before_fork/ _after_fork_parent/_after_fork_child on the stored runtime, and the runtime is freed alongside the exporter in dfree.
AtForkMonkeyPatch previously supported only the :child stage, which runs after a fork in the child process. The native trace exporter owns a long-lived tokio runtime with background worker threads that must be quiesced before a fork and restored afterwards in both the surviving parent and the child, mirroring libdatadog's before_fork / after_fork_parent / after_fork_child lifecycle. Additively introduce :before (pre-fork, parent) and :parent (post-fork, parent) stages alongside the existing :child stage: - Add AT_FORK_BEFORE_BLOCKS and AT_FORK_PARENT_BLOCKS. - at_fork/run_at_fork_blocks accept :before, :parent and :child, and still raise ArgumentError for any other stage. - ProcessMonkeyPatch#_fork runs :before before fork, then :parent or :child depending on the result. - KernelMonkeyPatch#fork runs :before before fork, then :parent in the parent branch and :child in the child branch. - ProcessMonkeyPatch#daemon runs :before then :child; daemon kills the parent so :parent is intentionally skipped. Wire the native transport to register :before/:parent/:child blocks for the exporter's fork-safety hooks. Note that on Ruby 3.1+ the :before block runs before every _fork, including system/popen subprocess spawns, so it is kept cheap. Profiling's child-only flow is unchanged.
In the native trace exporter send loop, a Thread#kill / shutdown fires the unblock function, which cancels the send token. That cancellation can cause rb_thread_call_without_gvl2 to return with send_ran == true (a cancelled, failed send). Because check_if_pending_exception() was only called when !send_ran, in that race the loop exited with the interrupt still pending, and the code fell through to build a transport error response, swallowing the interrupt. After the GVL loop, and after the response has been extracted/freed and chunks handed off to the ensure handler, check for a pending exception unconditionally and rb_jump_tag it if present, so the interrupt propagates instead of being reported as an ordinary error response. Response/chunk free ordering is preserved, so nothing leaks.
State the consequence in the after-fork warning so the message is actionable: a failed reset means traces may not be sent to Datadog. Apply the same consequence-stating style to the before-fork and after-fork-in-parent warnings.
A libdatadog Rust send releases the GVL during ddog_trace_exporter_send_trace_chunks, and the :before fork hook tears down and replaces the native runtime. If a thread forks while another is mid-send, the child inherits a half-completed send and Rust-internal locks, deadlocking or crashing. The existing :before hook only quiesces the runtime's own worker threads; it does not drain a Ruby thread that is mid-_native_send_traces. Add a per-transport mutex that serializes sends and is held across the fork. send_traces wraps only the native call in the mutex. The :before hook locks the mutex, blocking until any in-flight send drains before _native_before_fork runs; :parent and :child release it (guarded by owned? and run from an ensure so a failed reset can't leave it locked).
Add a delaying mock agent that signals when a request arrives, then waits before replying, keeping a send in-flight. The test starts a send on a background thread, waits for the agent to confirm the connection, then forks through the real AtForkMonkeyPatch path. It asserts the fork blocks until the in-flight send drains, the child sends successfully and exits zero, the parent send completes without error, and the parent transport still works afterwards.
The native transport registers process-global AtForkMonkeyPatch hooks that capture the exporter and are never deregistered, so the exporter and its Rust runtime threads stay alive after the transport is dropped. When a later native spec forks a mock agent while such a leaked exporter is alive, the child inherits a runtime whose worker threads did not survive the fork; freeing it on the child's exit deadlocks, so the child never dies and the parent's Process.wait hangs (a seed-dependent multi-minute hang in the combined suite). Snapshot and restore the global at_fork registry around every native-transport example group and GC afterwards, so an exporter kept alive only by removed hooks is freed in the parent before the next group forks.
Make at_fork return the registered block so callers can hold a handle, and add remove_at_fork(stage, block) to deregister it. Removing an absent block is a no-op; an unknown stage raises ArgumentError, matching the at_fork contract.
The native Transport registered three process-global at_fork hooks with no way to remove them, pinning the exporter (and its Rust runtime worker threads) alive forever and accumulating hooks across reconfiguration so every fork ran them against all historically-created exporters. Capture the at_fork handles and add #close to remove all three and drop the exporter reference (idempotent). Define a finalizer, built by a class method that captures only the hook handles (never self, so the Transport stays GC-eligible), as a fallback for dropped transports.
Wire Writer#stop and SyncWriter#stop to call transport.close when the transport responds to it. Stopping a writer is a permanent teardown (once stopped it refuses to restart, and forks reuse the same transport via #start rather than stopping it), so this is the correct point to deterministically release transports that hold native resources -- the native trace exporter's Rust runtime and its process-global at_fork hooks -- instead of waiting on the GC finalizer. The guard leaves the default HTTP transport (no #close) untouched, and #close is idempotent so repeated stops are safe.
Several native transport specs constructed a Transport (or a Writer backed by one) and never released it, so its native exporter -- and the long-lived Rust/tokio runtime it owns -- stayed reachable until interpreter exit. Freeing such an exporter at VM shutdown, after a real fork has happened earlier in the suite, deadlocks inside libdatadog's runtime teardown, hanging the combined native spec suite at process exit. The hang is seed-dependent because it needs a forking spec and a leaked exporter to coexist. Dropping the example references and running the existing per-group GC is not enough: each Transport also has an ObjectSpace finalizer whose captured fork-hook closures keep the exporter alive, and RSpec holds the example instance (and thus the Transport) reachable for the rest of the run, so the finalizer never fires. Add a NativeTransportForkIsolation.dispose helper that #close-s a transport (deregistering its global fork hooks and dropping its exporter reference) and undefines its now-redundant finalizer, releasing that last reference so the exporter is collected in the parent during the run. Dispose every transport the transport, fork, conformance, and configuration specs build. No native exporter then survives to interpreter exit.
The spec references `Datadog::Tracing::Transport::Native::InternalErrorResponse` from a second describe block, but only required the native transport from inside an earlier block, so running the file in isolation raised `uninitialized constant InternalErrorResponse` depending on example order. Require it at file scope, matching the sibling native specs.
3c30f51 to
fdac299
Compare
ba15a39 to
24a1ff3
Compare
Pin libdatadog to ~> 36.0.0.1.0 in the gemspec and extconf helper, and refresh all gemfiles/*.lock (constraint, resolved version, per-platform specs) plus the CHECKSUMS sha256 entries in gemfiles/ruby_4.0_http6.gemfile.lock. libdatadog 36.0.0.1.0 ships the trace exporter fork-safety FFI (ddog_trace_exporter_cancel_token_* and ddog_shared_runtime_*) that the native transport's fork handling depends on.
BenchmarksBenchmark execution time: 2026-06-23 08:59:07 Comparing candidate commit 5a66cab in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 48 metrics, 1 unstable metrics.
|
What does this PR do?
Add fork safety and cooperative request cancellation to the native trace exporter C extension:
Fork safety native methods:
_native_before_fork,_native_after_fork_in_parent,_native_after_fork_in_childonTraceExporter, delegating to libdatadog'sSharedRuntimefork hooks.Cooperative cancellation: Replace
RUBY_UBF_IOwith a per-send cancellation token.When Ruby interrupts the thread (shutdown,
Thread#kill), the custom UBF cancels thetoken, which cooperatively aborts the in-flight HTTP request in the Rust runtime. This
replaces the signal-based approach which could not actually cancel the Rust HTTP pipeline.
AtForkMonkeyPatch wiring: Register a
:childcallback inTransport::Native::Transport#initializethat calls_native_after_fork_in_childtorecreate the tokio runtime in forked child processes (Puma, Unicorn, Passenger).
Motivation:
FUP to #5690. Companion to DataDog/libdatadog#2051 which adds the FFI surface this
extension calls.
Without fork hooks, the Rust tokio runtime is dead in child processes after fork, and
subsequent send calls would hang or fail. Without cooperative cancellation,
Thread#killor Ruby shutdown during a send could leave the HTTP request running in the background.
Change log entry
None (not yet wired into the default tracer transport; no user-visible change).
Additional Notes:
AtForkMonkeyPatchonly supports:childstage —before_forkandafter_fork_in_parentare exposed as native methods but not wired into automatic callbacks. The child path is
the critical one:
SharedRuntime::after_fork_childcreates a fresh runtime regardless ofwhether the parent was prepared.
the GVL-released function and the UBF.
AI was used to accelerate implementation; all code was reviewed and understood.
How to test the change?
48 native exporter specs pass end-to-end. Fork safety and cancellation will need
integration tests with actual forking (future work).