[NO-TICKET] Profiling: Reset leftover per-thread state when creating new ThreadContext by ivoanjo · Pull Request #5926 · DataDog/dd-trace-rb

ivoanjo · 2026-06-22T13:58:35Z

What does this PR do?

Since #5816, we've been keeping the profiler's per-thread context directly attached to each Ruby thread.

This PR fixes a flaky spec that hit
master:

Failures:

  1) Datadog::Profiling::Collectors::ThreadContext#sample_after_gvl_running if thread has not been sampled before does not sample the thread
     Failure/Error: expect(samples).to be_empty
       expected `[#<struct ProfileHelpers::Sample locations=[#<struct ProfileHelpers::Frame base_label="sleep", path="...26761, :state=>"sleeping", :"thread id"=>"14823 (39460)", :"thread name"=>"Timeout stdlib thread"}>].empty?` to be truthy, got false
     # ./spec/datadog/profiling/collectors/thread_context_spec.rb:1934:in `block (4 levels) in <top (required)>'
     # ./spec/spec_helper.rb:327:in `block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:207:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/webmock-3.26.2/lib/webmock/rspec.rb:39:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in `block (2 levels) in <top (required)>'
     # ./spec/support/execute_in_fork.rb:32:in `run'

Finished in 27.48 seconds (files took 1.09 seconds to load)
806 examples, 1 failure, 15 pending

Failed examples:

rspec ./spec/datadog/profiling/collectors/thread_context_spec.rb:1931 # Datadog::Profiling::Collectors::ThreadContext#sample_after_gvl_running if thread has not been sampled before does not sample the thread

This spec failed because we didn't account for the background timeout thread in our assertions, and thus we saw an extra thread that failed the assertion.

Specifically, what I believe happened is that the background thread got a was_skipped_at_last_sample in a previous test, and that carried over to this test that assumed only one thread (t1) was in such a pending state.

By resetting the state for a new instance of the ThreadContext, this issue will no longer happen.

Motivation:

Fix flaky test and what I argue is also a logic issue (see below).

Change log entry

None. (Specifically -- the behavior being fixed was introduced in #5816 which was not shipped to customers yet, so there's no actual changelog from a fix)

Additional Notes:

When looking into this one, it dawned on me that there was a deeper problem being exposed here, which is why the solution to this problem is inside the production code in collectors_thread_context.c and not in the thread_context_spec.rb.

Specifically, all of cpu_time_at_previous_sample_ns, wall_time_at_previous_sample_ns, gvl_waiting_at, gvl_state_change_count, gvl_state_change_count_at_previous_sample, was_skipped_at_last_sample, gc_tracking.cpu_time_at_start_ns, gc_tracking.wall_time_at_start_ns carry their state across profiler restarts.

So a similar problem to the one that happened in our specs could in fact happen in production, where state from a previous profiler is used in sampling decisions on a new profiler. This is a bit of a corner case but consider something like:

Datadog.configure { |c| c.profiling.enabled = true }
Datadog::Profiling.wait_until_running
sleep 5
Datadog.configure { |c| c.profiling.enabled = false }
sleep 5
Datadog.configure { |c| c.profiling.enabled = true } # <-- profiler #2

I didn't try it, but I expect that profiler #2 would suddenly start using wall_time_at_previous_sample_ns on existing threads and assign one single sample with all of the time while the profiler was stopped; this was what effectively was happening in the test, but at a much smaller scale.

This solution fixes this by forcing the new profiler to start with a clean state.

How to test the change?

Existing coverage + the new test should be enough for this one.

…new ThreadContext **What does this PR do?** Since #5816, we've been keeping the profiler's per-thread context directly attached to each Ruby thread. This PR fixes a flaky spec that hit [master](https://github.com/DataDog/dd-trace-rb/actions/runs/27937925206/job/82664232085): ``` Failures: 1) Datadog::Profiling::Collectors::ThreadContext#sample_after_gvl_running if thread has not been sampled before does not sample the thread Failure/Error: expect(samples).to be_empty expected `[#<struct ProfileHelpers::Sample locations=[#<struct ProfileHelpers::Frame base_label="sleep", path="...26761, :state=>"sleeping", :"thread id"=>"14823 (39460)", :"thread name"=>"Timeout stdlib thread"}>].empty?` to be truthy, got false # ./spec/datadog/profiling/collectors/thread_context_spec.rb:1934:in `block (4 levels) in <top (required)>' # ./spec/spec_helper.rb:327:in `block (2 levels) in <top (required)>' # ./spec/spec_helper.rb:207:in `block (2 levels) in <top (required)>' # /usr/local/bundle/gems/webmock-3.26.2/lib/webmock/rspec.rb:39:in `block (2 levels) in <top (required)>' # /usr/local/bundle/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in `block (2 levels) in <top (required)>' # ./spec/support/execute_in_fork.rb:32:in `run' Finished in 27.48 seconds (files took 1.09 seconds to load) 806 examples, 1 failure, 15 pending Failed examples: rspec ./spec/datadog/profiling/collectors/thread_context_spec.rb:1931 # Datadog::Profiling::Collectors::ThreadContext#sample_after_gvl_running if thread has not been sampled before does not sample the thread ``` This spec failed because we didn't account for the background timeout thread in our assertions, and thus we saw an extra thread that failed the assertion. Specifically, what I believe happened is that the background thread got a `was_skipped_at_last_sample` in a previous test, and that carried over to this test that assumed only one thread (`t1`) was in such a pending state. By resetting the state for a new instance of the `ThreadContext`, this issue will no longer happen. **Motivation:** Fix flaky test and what I argue is also a logic issue (see below). **Additional Notes:** When looking into this one, it dawned on me that there was a deeper problem being exposed here, which is why the solution to this problem is inside the production code in `collectors_thread_context.c` and not in the `thread_context_spec.rb`. Specifically, all of `cpu_time_at_previous_sample_ns`, `wall_time_at_previous_sample_ns`, `gvl_waiting_at`, `gvl_state_change_count`, `gvl_state_change_count_at_previous_sample`, `was_skipped_at_last_sample`, `gc_tracking.cpu_time_at_start_ns`, `gc_tracking.wall_time_at_start_ns` carry their state across profiler restarts. So a similar problem to the one that happened in our specs could in fact happen in production, where state from a previous profiler is used in sampling decisions on a new profiler. This is a bit of a corner case but consider something like: ```ruby Datadog.configure { |c| c.profiling.enabled = true } Datadog::Profiling.wait_until_running sleep 5 Datadog.configure { |c| c.profiling.enabled = false } sleep 5 Datadog.configure { |c| c.profiling.enabled = true } # <-- profiler #2 ``` I didn't try it, but I expect that profiler #2 would suddenly start using `wall_time_at_previous_sample_ns` on existing threads and assign one single sample with all of the time while the profiler was stopped; this was what effectively was happening in the test, but at a much smaller scale. This solution fixes this by forcing the new profiler to start with a clean state. **How to test the change?** Existing coverage + the new test should be enough for this one.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9ee6be197f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-22T14:01:58Z

+  for (long i = 0; i < thread_count; i++) {
+    VALUE thread = RARRAY_AREF(threads, i);
+    per_thread_context *thread_context = get_per_thread_context(thread);
+    if (thread_context != NULL) reset_context_state(thread_context);


Defer resetting shared thread state until old profiler stops

When profiling is reconfigured while still enabled, Datadog::Core::Configuration.replace_components! constructs the new Components before calling old.shutdown! (lib/datadog/core/configuration.rb:264-277), so this reset runs while the previous profiler worker/scheduler can still be sampling or serializing the same global per_thread_context objects. In that window, clearing timestamps, GVL state, and was_skipped_at_last_sample can make the old profiler lose accumulated wall/CPU time or emit an inconsistent final profile; reset this shared state only after the old profiler is stopped, or during the new profiler start path after shutdown.

Useful? React with 👍 / 👎.

Hmmm this is a good point, let me look into it

datadog-datadog-prod-us1 · 2026-06-22T14:15:04Z

Tests

✨ Fix all issues with BitsAI

⚠️ Warnings

🚦 12 Pipeline jobs failed

System Tests | test / End-to-end #11 / rails80 11

System Tests | test / End-to-end #21 / rails61 21

System Tests | test / End-to-end #9 / rails80 9

View all 12 failed jobs.

ℹ️ Info

No other issues found (see more)

🧪 All tests passed
❄️ No new flaky tests detected

🎯 Code Coverage (details)
• Patch Coverage: 100.00%
• Overall Coverage: 90.00% (+0.01%)

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 9ee6be1 | Docs | Datadog PR Page | Give us feedback!}

pr-commenter · 2026-06-22T14:23:13Z

Benchmarks

Benchmark execution time: 2026-06-22 14:23:10

Comparing candidate commit 9ee6be1 in PR branch ivoanjo/fix-leftover-state-between-profilers with baseline commit d171f20 in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 48 metrics, 1 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

🟩 = significantly better candidate vs. baseline
🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

eregon · 2026-06-22T14:24:32Z

To me this is clearly a bug of the test:

      it "does not sample the thread" do
        sample_after_gvl_running(t1)

        expect(samples).to be_empty
      end

It's incorrect to assert samples to be empty here, there can always be background threads like the Timeout stdlib one we see in the failure.
So the test should only assert samples for t1 and nothing else.

BTW we already do testing_threads_and_current.each { |t| clear_per_thread_context_for(t) } in the spec, so I don't think this would solve it, e.g. if the Timeout thread was somehow started after the new ThreadContext is created.

In general I think resetting the state is a mistake, in production we cannot/shouldn't do it (it's quite unsafe, it's too difficult to stop the previous ThreadContext/Worker and ensure no postponed job/signal/etc trigger) so our tests shouldn't either.

ivoanjo · 2026-06-22T14:25:47Z

In general I think resetting the state is a mistake, in production we cannot/shouldn't do it (it's quite unsafe, it's too difficult to stop the previous ThreadContext/Worker and ensure no postponed job/signal/etc trigger) so our tests shouldn't either.

See my note on the PR description -- if we don't reset the state, we'll be pushing very weird samples from old profilers.

eregon · 2026-06-22T14:27:43Z

See my note on the PR description -- if we don't reset the state, we'll be pushing very weird samples from old profilers.

Yes, I saw that and I think this is no big deal. OTOH inconsistent state because of concurrent access to the per-thread state would be much worse: inconsistent state and potentially segfaults (e.g. concurrent access to sampling_buffer would be pretty bad, though this PR doesn't reset that one, but still).

eregon · 2026-06-22T14:29:30Z

IOW given the per_thread_context are global, I think we should embrace it. Fighting it is not going to work.
If we wanted a new clean per_thread_context for every new ThreadContext, then we should have picked that approach, but as we saw in #5816, it does not work (too difficult to stop the previous ThreadContext/Worker).

**What does this PR do?** This PR skips a flaky spec that hit [master](https://github.com/DataDog/dd-trace-rb/actions/runs/27937925206/job/82664232085): ``` Failures: 1) Datadog::Profiling::Collectors::ThreadContext#sample_after_gvl_running if thread has not been sampled before does not sample the thread Failure/Error: expect(samples).to be_empty expected `[#<struct ProfileHelpers::Sample locations=[#<struct ProfileHelpers::Frame base_label="sleep", path="...26761, :state=>"sleeping", :"thread id"=>"14823 (39460)", :"thread name"=>"Timeout stdlib thread"}>].empty?` to be truthy, got false # ./spec/datadog/profiling/collectors/thread_context_spec.rb:1934:in `block (4 levels) in <top (required)>' # ./spec/spec_helper.rb:327:in `block (2 levels) in <top (required)>' # ./spec/spec_helper.rb:207:in `block (2 levels) in <top (required)>' # /usr/local/bundle/gems/webmock-3.26.2/lib/webmock/rspec.rb:39:in `block (2 levels) in <top (required)>' # /usr/local/bundle/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in `block (2 levels) in <top (required)>' # ./spec/support/execute_in_fork.rb:32:in `run' Finished in 27.48 seconds (files took 1.09 seconds to load) 806 examples, 1 failure, 15 pending Failed examples: rspec ./spec/datadog/profiling/collectors/thread_context_spec.rb:1931 # Datadog::Profiling::Collectors::ThreadContext#sample_after_gvl_running if thread has not been sampled before does not sample the thread ``` We started fixing it in DataDog#5926 but it looks like the fix might be more involved than initially considered so let's unblock master while we figure out a good path. **Motivation:** Unblock master while we figure out a better solution **Additional Notes:** N/A **How to test the change?** CI should be green, now that this test is skipped

ivoanjo · 2026-06-24T10:59:49Z

Ok so I was missing a final sleep 1 on my example in the PR description.

With

Datadog.configure { |c| c.profiling.enabled = true } Datadog::Profiling.wait_until_running sleep 5 Datadog.configure { |c| c.profiling.enabled = false } sleep 5 Datadog.configure { |c| c.profiling.enabled = true } # <-- profiler #2 sleep 1

I got what I feared... a 1-second profile with 6 seconds of data (representing the gap while the profiler was stopped)

For the original flakiness, I could "reproduce it" with

describe "#sample_after_gvl_running" do before { skip_if_gvl_profiling_not_supported(self) } fcontext "if thread does not have per-thread context" do before { remove_per_thread_context_for(t1) } # it do # expect(sample_after_gvl_running(t1)).to be false # end it 'weird test setup' do sleep_thread = Thread.new { sleep } sample # Creates context on_gvl_released(sleep_thread) sample # Waiting starts sample # Sets skip pp per_thread_context[sleep_thread] end it "does not sample the thread" do # skip("This is flaky -- we're discussing a full fix in https://github.com/DataDog/dd-trace-rb/pull/5926 but for now let's skip") sample_after_gvl_running(t1) expect(samples).to be_empty end end

And running with --order=defined to make sure the specs ran in this order. There is one very subtle detail here -- the was_skipped_at_last_sample gets reset when the recorder is flushed so the issue in particular that made the spec flaky might not happen in practice in production since when we stop the profiler we flush the recorder.

ivoanjo requested review from a team as code owners June 22, 2026 13:58

dd-octo-sts Bot added the profiling Involves Datadog profiling label Jun 22, 2026

ivoanjo commented Jun 22, 2026

View reviewed changes

Comment thread ext/datadog_profiling_native_extension/collectors_thread_context.c

ivoanjo requested a review from eregon June 22, 2026 14:00

chatgpt-codex-connector Bot reviewed Jun 22, 2026

View reviewed changes

ivoanjo mentioned this pull request Jun 22, 2026

[NO-TICKET] Skip flaky profiling spec until a full fix #5929

Merged

ivoanjo commented Jun 26, 2026

View reviewed changes

Uh oh!

Conversation

ivoanjo commented Jun 22, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

ivoanjo Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

datadog-datadog-prod-us1 Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

ℹ️ Info

Uh oh!

pr-commenter Bot commented Jun 22, 2026

Benchmarks

Explanation

More details about the CI and significant changes

Uh oh!

eregon commented Jun 22, 2026

Uh oh!

ivoanjo commented Jun 22, 2026

Uh oh!

eregon commented Jun 22, 2026

Uh oh!

eregon commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivoanjo Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

ivoanjo Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

datadog-datadog-prod-us1 Bot commented Jun 22, 2026 •

edited

Loading

eregon commented Jun 22, 2026 •

edited

Loading