Skip to content

Improve native trace exporter#5817

Open
lloeki wants to merge 25 commits into
masterfrom
lloeki/native-trace-exporter-improved
Open

Improve native trace exporter#5817
lloeki wants to merge 25 commits into
masterfrom
lloeki/native-trace-exporter-improved

Conversation

@lloeki

@lloeki lloeki commented May 27, 2026

Copy link
Copy Markdown
Member

What does this PR do?

Brings the native trace exporter C extension (introduced in #5690) up to the
current libdatadog main FFI. Net effect: the response body is now extracted via
ddog_trace_exporter_response_get_body returning a ddog_ByteSlice (instead of
the older raw pointer + out-length pair).

The branch history also records an exploration that was tried and reverted:
construction functions were briefly switched to ddog_MaybeError and then
restored to ddog_TraceExporterError, because upstream kept ExporterError
(its code is needed to classify HTTP responses). The commits are kept rather
than squashed to preserve that review trail.

Motivation:

#5690 is an intentional "historical intermediate point" that compiles against an
earlier libdatadog FFI. This PR is the focused diff that adapts it to the merged
main API, kept as a separate PR (rather than folded into #5690) to preserve the
review history and cross-linking.

Part of the stack: #5690 -> #5817 -> #5699 -> #5835. A separate change
(#5859) bumps the libdatadog gem to 35.0.0, which is the released version that
actually carries this ByteSlice API.

Change log entry

None.

Additional Notes:

AI was used to accelerate implementation; all code was reviewed and understood.

How to test the change?

Covered by the native transport specs introduced in the stack
(spec/datadog/tracing/transport/native/). Building the C extension against a
libdatadog that exposes the ddog_ByteSlice get_body API exercises the change;
CI runs the suite once the gem is bumped to 35.0.0 (#5859).

@lloeki lloeki changed the base branch from master to lloeki/native-trace-exporter May 27, 2026 13:55
@datadog-prod-us1-5

datadog-prod-us1-5 Bot commented May 27, 2026

Copy link
Copy Markdown

Pipelines  Tests

Fix all issues with BitsAI

⚠️ Warnings

🚦 7 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-rb | ruby-app-deployment-mode.amd64.DOA9: [public.ecr.aws/lts/ubuntu:22.04, linux/amd64, 3.1.7]   View in Datadog   GitLab

DataDog/apm-reliability/dd-trace-rb | ruby-app-deployment-mode.amd64.DOC: [public.ecr.aws/lts/ubuntu:22.04, linux/amd64, 3.1.7]   View in Datadog   GitLab

DataDog/apm-reliability/dd-trace-rb | ruby-app-deployment-mode.arm64.DOA9: [public.ecr.aws/lts/ubuntu:22.04, linux/arm64, 3.1.7]   View in Datadog   GitLab

View all 7 failed jobs.

ℹ️ Info

No other issues found (see more)

🧪 All tests passed
❄️ No new flaky tests detected

🎯 Code Coverage (details)
Patch Coverage: 48.15%
Overall Coverage: 89.84% (-0.02%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 128a133 | Docs | Datadog PR Page | Give us feedback!

@dd-octo-sts

dd-octo-sts Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Thank you for updating Change log entry section 👏

Visited at: 2026-06-04 08:12:23 UTC

@lloeki lloeki force-pushed the lloeki/native-trace-exporter branch from d5fa393 to bd141e3 Compare May 28, 2026 11:51
@lloeki lloeki force-pushed the lloeki/native-trace-exporter-improved branch from b49856e to 2b00aa3 Compare May 28, 2026 11:51
@lloeki lloeki marked this pull request as ready for review May 29, 2026 12:16
@lloeki lloeki requested review from a team as code owners May 29, 2026 12:16

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bf8ee694ff

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread ext/libdatadog_api/trace_exporter.c
@lloeki lloeki force-pushed the lloeki/native-trace-exporter branch from d8f4277 to 8bf4660 Compare June 17, 2026 12:07
@lloeki lloeki force-pushed the lloeki/native-trace-exporter-improved branch from bf8ee69 to 444d0e5 Compare June 17, 2026 12:07
lloeki added 19 commits June 17, 2026 14:53
Introduce the `Datadog::Tracing::Transport::Native::TracerSpan` class
backed by libdatadog's `ddog_TracerSpan` FFI.

`TracerSpan._native_from_span(span)` reads instance variables directly
from a `Datadog::Tracing::Span` (`@name`, `@service`, `@meta`, etc.)
and constructs a Rust-heap-allocated span via the C FFI, including:

- Scalar fields (name, service, resource, type, IDs, timestamps)
- 128-bit trace ID split via `trace_id_t` struct into (low, high)
  64-bit halves
- String tags (meta) and numeric tags (metrics) via safe hash
  iteration that stashes errors instead of raising inside
  `rb_hash_foreach`

This is the first part of the trace exporter C extension; `TraceExporter`
and `Response` classes will follow.
Introduce the `Datadog::Tracing::Transport::Native::TraceExporter`
class backed by libdatadog's `ddog_TraceExporter` FFI.

`TraceExporter._native_new(url, tracer_version, language, ...)` builds
a Rust `TraceExporter` through the config/build pattern:

- Validates argument types upfront (before allocating Rust resources)
- Creates a `ddog_TraceExporterConfig`, applies each setting via
  `SET_CONFIG` macro with cleanup-on-error
- Builds the exporter and wraps it in Ruby `TypedData`

The `TraceExporter` is freed via `ddog_trace_exporter_free` on GC.
`Response` and `_native_send_traces` will follow.
Introduce `Datadog::Tracing::Transport::Native::Response` with the
full set of predicate methods expected by the `Writer`:

- `ok?`, `internal_error?`, `server_error?`, `client_error?`,
  `not_found?`, `unsupported?`
- `trace_count` and `payload` (agent response body for sampling
  rate feedback)

Two C helpers build responses from the native side:

- `create_ok_response`: success with optional payload
- `create_error_response`: classifies `ddog_TraceExporterErrorCode`
  into client/server/internal error categories

These will be used by `_native_send_traces` in the next step.
Implement `TraceExporter#_native_send_traces(traces)` which takes an
`Array<Array<Span>>`, converts each span to Rust via the C FFI,
groups them into trace chunks, and sends them to the Datadog Agent.

Key safety patterns:

- `rb_ensure` guarantees Rust-allocated trace chunks are freed if a
  Ruby exception fires during span conversion
- `rb_thread_call_without_gvl2` releases the GVL during the blocking
  network send so other Ruby threads can run
- The "gvl2" variant is used (not plain `gvl`) to prevent automatic
  interrupt raising before Rust response/error objects are inspected
  and freed
- A retry loop handles the case where an interrupt causes `gvl2` to
  return before the send function actually runs

The agent's response body (containing `rate_by_service` for sampling
feedback) is extracted and surfaced via `Response#payload`.

This completes the C extension for the native trace transport.
Classes created with `rb_define_class_under` are attached to their
parent module's constant table, making them reachable by GC.
Marking them as GC roots with `rb_global_variable` is unnecessary.
Delegate to the existing helper instead of duplicating the
`RSTRING_PTR`/`RSTRING_LEN` construction and type check inline.
Use a typed function pointer and a `static inline` function instead of
a preprocessor macro. Same codegen, easier to debug and type-check.
Adopt the same pattern used by the profiling extension: a
`process_pending_interruptions` callback paired with a
`check_if_pending_exception` wrapper that returns the pending state
as an integer instead of requiring manual `rb_protect` at the call
site.
Replace the `ddog_TraceExporterResponse **response_out` double pointer
with a `ddog_TraceExporterResponse *response` field and pass its
address to the FFI call. This removes one level of indirection and
keeps all send-related state within the args struct.
Use a descriptive variable name for the cast data pointer.
Move error handling into `send_chunks_without_gvl`: extract the error
code enum and free the `ddog_TraceExporterError` before returning.
After reacquiring the GVL only the plain `error_code` and `failed`
flag remain, eliminating the need to free Rust-allocated error objects
on every code path.
Accept keyword arguments via `rb_scan_args(argc, argv, "0:", ...)`
instead of 9 positional VALUE parameters. This makes call sites
self-documenting and prevents silent argument ordering mistakes.

The Ruby call becomes:
  TraceExporter._native_new(url:, tracer_version:, language:, ...)
Define Response in Ruby with keyword-argument `initialize` and
`attr_reader` accessors. The C extension loads the class at init time
via `rb_require` and calls `Response.new(ok:, ...)` through
`rb_funcallv_kw` instead of manually setting instance variables.

This gives Ruby's JIT better visibility into the object shape and
reduces the amount of C code.
Count entries skipped due to unexpected types during hash iteration
and log a warning after the loop completes. Previously these entries
were silently dropped, making it difficult to diagnose missing tags.
Replace two `rb_funcall` calls (`to_i` + `nsec`) with a single
`rb_time_timespec` call that extracts the underlying `struct timespec`
directly from the Ruby Time object. Zero method dispatch overhead.
Replace two `rb_funcall` calls (`&` and `>>`) with direct extraction
via `rb_big_pack`, which copies the Bignum magnitude into a word
buffer without method dispatch or temporary object allocation.

A Fixnum fast path handles the common case of 64-bit trace IDs with
a simple `FIX2LONG` cast.
Construct a `ddog_TracerSpanFields` struct with designated initializers
and pass it by reference to `ddog_tracer_span_new`. This matches the
updated libdatadog FFI signature and is self-documenting at the call
site.
Hoist `span_count` before `ddog_tracer_trace_chunks_begin_chunk` and
pass it as the capacity argument to pre-allocate the inner span
vector.
The error path is unreachable in practice (the handle is always
valid), but the return value must still be freed to avoid leaking
the Rust-allocated error object.
@lloeki lloeki force-pushed the lloeki/native-trace-exporter-improved branch from 444d0e5 to dc44d9b Compare June 17, 2026 12:54
@lloeki lloeki force-pushed the lloeki/native-trace-exporter branch from 8bf4660 to 44ba8ad Compare June 17, 2026 12:54
@lloeki lloeki changed the base branch from lloeki/native-trace-exporter to master June 17, 2026 15:07
@lloeki lloeki requested review from a team as code owners June 17, 2026 15:07
@lloeki lloeki requested review from mabdinur and wantsui June 17, 2026 15:07
create_ok_response/create_error_response build the Response object from C
with keyword arguments via rb_funcallv_kw + RB_PASS_KEYWORDS, both of
which were introduced in Ruby 2.7. On Ruby 2.5/2.6 the extension failed
to compile (implicit declaration of rb_funcallv_kw, RB_PASS_KEYWORDS
undeclared), fatal under -Werror-implicit-function-declaration. Probe for
the function in extconf.rb and, when absent, pass the kwargs hash as a
trailing positional argument, which pre-2.7 Ruby treats as keywords.
@lloeki lloeki force-pushed the lloeki/native-trace-exporter-improved branch from dc44d9b to 0d9c573 Compare June 17, 2026 15:32
@dd-octo-sts dd-octo-sts Bot added the tracing label Jun 17, 2026
@lloeki lloeki force-pushed the lloeki/native-trace-exporter-improved branch from 0d9c573 to 20a4b5d Compare June 17, 2026 15:51
@pr-commenter

pr-commenter Bot commented Jun 17, 2026

Copy link
Copy Markdown

Benchmarks

Benchmark execution time: 2026-06-17 17:43:53

Comparing candidate commit 128a133 in PR branch lloeki/native-trace-exporter-improved with baseline commit d3e9f45 in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 48 metrics, 1 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

lloeki added 4 commits June 17, 2026 19:17
The native trace exporter / transport specs need the libdatadog_api
extension, so they can't run in spec:main (no compile). Put them in a
dedicated NATIVE_TRANSPORT_SPECS group and a new spec:native_transport
task (enhance([:compile])): excluded from spec:main, listed in spec:all,
and added to the Matrixfile.

Keep them OUT of core_with_libdatadog_api: that pool also runs the
fork-heavy ProcessDiscovery specs, and a live native exporter
(SharedRuntime threads) present when ProcessDiscovery forks deadlocks the
child until fork safety lands (#5835). A separate task runs them in their
own process.
Switch from `ddog_TraceExporterError *` to `ddog_MaybeError` for all
tracer construction functions (`span_new`, `set_meta`, `set_metric`,
`trace_chunks_new`, `begin_chunk`, `push_span`). Use the existing
`read_ddogerr_string_and_drop` helper via a new `check_maybe_error`
inline function.

`send_trace_chunks` retains `ddog_TraceExporterError` because callers
need the error code for HTTP response classification.
Replace the separate `*const uint8_t` return + `out_len` out-parameter
with a single `ddog_ByteSlice` struct, matching the updated
libdatadog FFI signature.
The tracer construction functions return `ddog_TraceExporterError *`
again, matching the libdatadog FFI. Restore `check_exporter_error`
usage at the call sites and drop the `ddog_MaybeError`-based handling.

This reverts the consumer-side `ddog_MaybeError` adaptation: the
structured `ExporterError` type with error codes is retained upstream
instead.
@lloeki lloeki force-pushed the lloeki/native-trace-exporter-improved branch from 20a4b5d to 128a133 Compare June 17, 2026 17:18
@wantsui wantsui requested a review from marcotc June 22, 2026 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant