Skip to content

Skip per-tick sampling of profiler-internal threads#5955

Open
eregon wants to merge 5 commits into
masterfrom
dont-sample-profiler-internal-threads
Open

Skip per-tick sampling of profiler-internal threads#5955
eregon wants to merge 5 commits into
masterfrom
dont-sample-profiler-internal-threads

Conversation

@eregon

@eregon eregon commented Jun 25, 2026

Copy link
Copy Markdown
Member

CpuAndWallTimeWorker and IdleSamplingHelper threads are now sampled only once per profiling period (via the on-serialize flush) instead of every 10ms tick, reducing profiling overhead.

What does this PR do?
^

Motivation:
Reduce profiler overhead

Change log entry

Yes, Changed: Profiling: Reduce profiler overhead by sampling profiler-internal threads only once per minute.

Additional Notes:

How to test the change?

@dd-octo-sts dd-octo-sts Bot added the profiling Involves Datadog profiling label Jun 25, 2026
@datadog-datadog-prod-us1-2

datadog-datadog-prod-us1-2 Bot commented Jun 25, 2026

Copy link
Copy Markdown

Pipelines  Tests

Fix all issues with BitsAI

⚠️ Warnings

🚦 2 Pipeline jobs failed

Test for memory leaks | test-memcheck   View in Datadog   GitHub Actions

Check Pull Request CI Status | all-jobs-are-green   View in Datadog   GitHub Actions

❄️ 1 New flaky test detected

Datadog::Profiling::Collectors::ThreadContext profiler-internal thread skipping still records one sample per profile period via serialize from rspec   View in Datadog
expected: 1
     got: 0

(compared using ==)

Failure/Error: expect(t1_samples.size).to eq(1)

  expected: 1
       got: 0

...

New test introduced in this PR is flaky.

View in Flaky Test Management

ℹ️ Info

No other issues found (see more)

🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 90.01% (-0.03%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: ce3c80f | Docs | Datadog PR Page | Give us feedback!

@eregon eregon force-pushed the dont-sample-profiler-internal-threads branch 2 times, most recently from 6656528 to 3b3cf59 Compare June 25, 2026 14:15
@eregon

eregon commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

In the benchmarks/profiling_sample_overhead.rb benchmark we still have ~1000 samples for the profiling ovearhead (Datadog::Profiling::Sampling, sampling), even though we didn't sample any thread, but it still took CPU time (~40µs).
What should we do about those? Maybe we should group them by reporting period too to reduce the number of samples?

@eregon eregon force-pushed the dont-sample-profiler-internal-threads branch 2 times, most recently from 10efbe0 to b3d4ae0 Compare June 25, 2026 14:30
@eregon eregon marked this pull request as ready for review June 25, 2026 14:31
@eregon eregon requested review from a team as code owners June 25, 2026 14:31
@eregon

eregon commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

Before:

{duration: 10.02499,
 samples: 998,
 trigger_simulated_signal_delivery_attempts: 996,
 cpu_sampling_time_ns_total: 30466000,
 cpu_sampling_overhead: "0.30%",
 serialization_time_ns_total: 4696000,
 serialization_overhead: "0.05%",
 sampling_rate: 99.5512214974778,
 sample_every_n_ms: 10.045080160320643,
 inactive_thread_samples_skipped: 3983}

After:

{duration: 10.020218,
 samples: 998,
 trigger_simulated_signal_delivery_attempts: 997,
 cpu_sampling_time_ns_total: 24977000,
 cpu_sampling_overhead: "0.25%",
 serialization_time_ns_total: 2432000,
 serialization_overhead: "0.02%",
 sampling_rate: 99.59863148685987,
 sample_every_n_ms: 10.04029859719439,
 inactive_thread_samples_skipped: 2988,
 profiler_thread_samples_skipped: 1996}

cpu_sampling_overhead and serialization_overhead drop and all regular samples are skipped for the 2 profiler-internal threads.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3d4ae0f42

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread ext/datadog_profiling_native_extension/collectors_thread_context.c
// situation we stop immediately and never even start the sampling trigger loop.
if (state->stop_thread == rb_thread_current()) return Qnil;

thread_context_collector_mark_current_thread_as_profiler_internal();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid charging sampler sleep as overhead

When no_signals_workaround_enabled is true, run_sampling_trigger_loop calls grab_gvl_and_sample() from this same worker thread, making CpuAndWallTimeWorker the current_thread in thread_context_collector_sample. Because this line now marks that current thread as internal, the new skip path returns before updating its previous-sample timestamps, and the immediately following record_sampling_overhead(state, current_thread_context) consumes the whole interval since the last tick as profiler overhead; no-signal profiles will therefore attribute the worker's sleep/wait time to overhead instead of recording one internal-thread sample per period.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alas, this was a nice and simple change but it won't be so simple anymore 😓

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 4909597 and e2c83f2.
It's much simpler if we are OK to attribute sampling overhead to the profiler-internal threads in the case they are the current thread for thread_context_collector_sample().

CpuAndWallTimeWorker and IdleSamplingHelper threads are now sampled
only once per profiling period (via the on-serialize flush) instead
of every 10ms tick, reducing profiling overhead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@pr-commenter

pr-commenter Bot commented Jun 25, 2026

Copy link
Copy Markdown

Benchmarks

Benchmark execution time: 2026-06-25 15:49:45

Comparing candidate commit e2c83f2 in PR branch dont-sample-profiler-internal-threads with baseline commit 75ccdd1 in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 48 metrics, 1 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

In no-signals workaround mode, the worker thread itself runs sampling
via grab_gvl_and_sample. Since the worker is now skipped in the sample
loop (timestamps not updated), record_sampling_overhead would attribute
the entire sleep interval as overhead. Fix by temporarily setting
timestamps to sample-start, then restoring them so on_serialize can
still accumulate the full reporting period.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@eregon eregon force-pushed the dont-sample-profiler-internal-threads branch from b3d4ae0 to 4909597 Compare June 25, 2026 15:09
eregon and others added 2 commits June 25, 2026 17:22
…iler-internal threads

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mples

on_serialize now force-samples profiler-internal threads, which could
satisfy the samples.any? check before any signal-triggered sample ran.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

profiling Involves Datadog profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant