Skip to content

[NO-TICKET] Profiling: Reset leftover per-thread state when creating new ThreadContext#5926

Open
ivoanjo wants to merge 1 commit into
masterfrom
ivoanjo/fix-leftover-state-between-profilers
Open

[NO-TICKET] Profiling: Reset leftover per-thread state when creating new ThreadContext#5926
ivoanjo wants to merge 1 commit into
masterfrom
ivoanjo/fix-leftover-state-between-profilers

Conversation

@ivoanjo

@ivoanjo ivoanjo commented Jun 22, 2026

Copy link
Copy Markdown
Member

What does this PR do?

Since #5816, we've been keeping the profiler's per-thread context directly attached to each Ruby thread.

This PR fixes a flaky spec that hit
master:

Failures:

  1) Datadog::Profiling::Collectors::ThreadContext#sample_after_gvl_running if thread has not been sampled before does not sample the thread
     Failure/Error: expect(samples).to be_empty
       expected `[#<struct ProfileHelpers::Sample locations=[#<struct ProfileHelpers::Frame base_label="sleep", path="...26761, :state=>"sleeping", :"thread id"=>"14823 (39460)", :"thread name"=>"Timeout stdlib thread"}>].empty?` to be truthy, got false
     # ./spec/datadog/profiling/collectors/thread_context_spec.rb:1934:in `block (4 levels) in <top (required)>'
     # ./spec/spec_helper.rb:327:in `block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:207:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/webmock-3.26.2/lib/webmock/rspec.rb:39:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in `block (2 levels) in <top (required)>'
     # ./spec/support/execute_in_fork.rb:32:in `run'

Finished in 27.48 seconds (files took 1.09 seconds to load)
806 examples, 1 failure, 15 pending

Failed examples:

rspec ./spec/datadog/profiling/collectors/thread_context_spec.rb:1931 # Datadog::Profiling::Collectors::ThreadContext#sample_after_gvl_running if thread has not been sampled before does not sample the thread

This spec failed because we didn't account for the background timeout thread in our assertions, and thus we saw an extra thread that failed the assertion.

Specifically, what I believe happened is that the background thread got a was_skipped_at_last_sample in a previous test, and that carried over to this test that assumed only one thread (t1) was in such a pending state.

By resetting the state for a new instance of the ThreadContext, this issue will no longer happen.

Motivation:

Fix flaky test and what I argue is also a logic issue (see below).

Change log entry

None. (Specifically -- the behavior being fixed was introduced in #5816 which was not shipped to customers yet, so there's no actual changelog from a fix)

Additional Notes:

When looking into this one, it dawned on me that there was a deeper problem being exposed here, which is why the solution to this problem is inside the production code in collectors_thread_context.c and not in the thread_context_spec.rb.

Specifically, all of cpu_time_at_previous_sample_ns, wall_time_at_previous_sample_ns, gvl_waiting_at, gvl_state_change_count, gvl_state_change_count_at_previous_sample, was_skipped_at_last_sample, gc_tracking.cpu_time_at_start_ns, gc_tracking.wall_time_at_start_ns carry their state across profiler restarts.

So a similar problem to the one that happened in our specs could in fact happen in production, where state from a previous profiler is used in sampling decisions on a new profiler. This is a bit of a corner case but consider something like:

Datadog.configure { |c| c.profiling.enabled = true }
Datadog::Profiling.wait_until_running
sleep 5
Datadog.configure { |c| c.profiling.enabled = false }
sleep 5
Datadog.configure { |c| c.profiling.enabled = true } # <-- profiler #2

I didn't try it, but I expect that profiler #2 would suddenly start using wall_time_at_previous_sample_ns on existing threads and assign one single sample with all of the time while the profiler was stopped; this was what effectively was happening in the test, but at a much smaller scale.

This solution fixes this by forcing the new profiler to start with a clean state.

How to test the change?

Existing coverage + the new test should be enough for this one.

…new ThreadContext

**What does this PR do?**

Since #5816, we've been
keeping the profiler's per-thread context directly attached to each Ruby
thread.

This PR fixes a flaky spec that hit
[master](https://github.com/DataDog/dd-trace-rb/actions/runs/27937925206/job/82664232085):

```
Failures:

  1) Datadog::Profiling::Collectors::ThreadContext#sample_after_gvl_running if thread has not been sampled before does not sample the thread
     Failure/Error: expect(samples).to be_empty
       expected `[#<struct ProfileHelpers::Sample locations=[#<struct ProfileHelpers::Frame base_label="sleep", path="...26761, :state=>"sleeping", :"thread id"=>"14823 (39460)", :"thread name"=>"Timeout stdlib thread"}>].empty?` to be truthy, got false
     # ./spec/datadog/profiling/collectors/thread_context_spec.rb:1934:in `block (4 levels) in <top (required)>'
     # ./spec/spec_helper.rb:327:in `block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:207:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/webmock-3.26.2/lib/webmock/rspec.rb:39:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in `block (2 levels) in <top (required)>'
     # ./spec/support/execute_in_fork.rb:32:in `run'

Finished in 27.48 seconds (files took 1.09 seconds to load)
806 examples, 1 failure, 15 pending

Failed examples:

rspec ./spec/datadog/profiling/collectors/thread_context_spec.rb:1931 # Datadog::Profiling::Collectors::ThreadContext#sample_after_gvl_running if thread has not been sampled before does not sample the thread
```

This spec failed because we didn't account for the background timeout
thread in our assertions, and thus we saw an extra thread that failed
the assertion.

Specifically, what I believe happened is that the background
thread got a `was_skipped_at_last_sample` in a previous test, and
that carried over to this test that assumed only one thread (`t1`)
was in such a pending state.

By resetting the state for a new instance of the `ThreadContext`,
this issue will no longer happen.

**Motivation:**

Fix flaky test and what I argue is also a logic issue (see below).

**Additional Notes:**

When looking into this one, it dawned on me that there was a
deeper problem being exposed here, which is why the solution to this
problem is inside the production code in `collectors_thread_context.c`
and not in the `thread_context_spec.rb`.

Specifically, all of `cpu_time_at_previous_sample_ns`,
`wall_time_at_previous_sample_ns`, `gvl_waiting_at`,
`gvl_state_change_count`, `gvl_state_change_count_at_previous_sample`,
`was_skipped_at_last_sample`, `gc_tracking.cpu_time_at_start_ns`,
`gc_tracking.wall_time_at_start_ns` carry their state across profiler
restarts.

So a similar problem to the one that happened in our specs could in fact
happen in production, where state from a previous profiler is used in
sampling decisions on a new profiler. This is a bit of a corner case
but consider something like:

```ruby
Datadog.configure { |c| c.profiling.enabled = true }
Datadog::Profiling.wait_until_running
sleep 5
Datadog.configure { |c| c.profiling.enabled = false }
sleep 5
Datadog.configure { |c| c.profiling.enabled = true } # <-- profiler #2
```

I didn't try it, but I expect that profiler #2 would suddenly start using
`wall_time_at_previous_sample_ns` on existing threads and assign one
single sample with all of the time while the profiler was stopped;
this was what effectively was happening in the test, but at a much
smaller scale.

This solution fixes this by forcing the new profiler to start with
a clean state.

**How to test the change?**

Existing coverage + the new test should be enough for this one.
@ivoanjo ivoanjo requested review from a team as code owners June 22, 2026 13:58
@dd-octo-sts dd-octo-sts Bot added the profiling Involves Datadog profiling label Jun 22, 2026
Comment thread ext/datadog_profiling_native_extension/collectors_thread_context.c
@ivoanjo ivoanjo requested a review from eregon June 22, 2026 14:00

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9ee6be197f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

for (long i = 0; i < thread_count; i++) {
VALUE thread = RARRAY_AREF(threads, i);
per_thread_context *thread_context = get_per_thread_context(thread);
if (thread_context != NULL) reset_context_state(thread_context);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Defer resetting shared thread state until old profiler stops

When profiling is reconfigured while still enabled, Datadog::Core::Configuration.replace_components! constructs the new Components before calling old.shutdown! (lib/datadog/core/configuration.rb:264-277), so this reset runs while the previous profiler worker/scheduler can still be sampling or serializing the same global per_thread_context objects. In that window, clearing timestamps, GVL state, and was_skipped_at_last_sample can make the old profiler lose accumulated wall/CPU time or emit an inconsistent final profile; reset this shared state only after the old profiler is stopped, or during the new profiler start path after shutdown.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm this is a good point, let me look into it

@datadog-datadog-prod-us1

datadog-datadog-prod-us1 Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Pipelines  Tests

Fix all issues with BitsAI

⚠️ Warnings

🚦 12 Pipeline jobs failed

System Tests | test / End-to-end #11 / rails80 11   View in Datadog   GitHub Actions

System Tests | test / End-to-end #21 / rails61 21   View in Datadog   GitHub Actions

System Tests | test / End-to-end #9 / rails80 9   View in Datadog   GitHub Actions

View all 12 failed jobs.

ℹ️ Info

No other issues found (see more)

🧪 All tests passed
❄️ No new flaky tests detected

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 90.00% (+0.01%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 9ee6be1 | Docs | Datadog PR Page | Give us feedback!

@pr-commenter

pr-commenter Bot commented Jun 22, 2026

Copy link
Copy Markdown

Benchmarks

Benchmark execution time: 2026-06-22 14:23:10

Comparing candidate commit 9ee6be1 in PR branch ivoanjo/fix-leftover-state-between-profilers with baseline commit d171f20 in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 48 metrics, 1 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

@eregon

eregon commented Jun 22, 2026

Copy link
Copy Markdown
Member

To me this is clearly a bug of the test:

      it "does not sample the thread" do
        sample_after_gvl_running(t1)

        expect(samples).to be_empty
      end

It's incorrect to assert samples to be empty here, there can always be background threads like the Timeout stdlib one we see in the failure.
So the test should only assert samples for t1 and nothing else.

BTW we already do testing_threads_and_current.each { |t| clear_per_thread_context_for(t) } in the spec, so I don't think this would solve it, e.g. if the Timeout thread was somehow started after the new ThreadContext is created.

In general I think resetting the state is a mistake, in production we cannot/shouldn't do it (it's quite unsafe, it's too difficult to stop the previous ThreadContext/Worker and ensure no postponed job/signal/etc trigger) so our tests shouldn't either.

@ivoanjo

ivoanjo commented Jun 22, 2026

Copy link
Copy Markdown
Member Author

In general I think resetting the state is a mistake, in production we cannot/shouldn't do it (it's quite unsafe, it's too difficult to stop the previous ThreadContext/Worker and ensure no postponed job/signal/etc trigger) so our tests shouldn't either.

See my note on the PR description -- if we don't reset the state, we'll be pushing very weird samples from old profilers.

@eregon

eregon commented Jun 22, 2026

Copy link
Copy Markdown
Member

See my note on the PR description -- if we don't reset the state, we'll be pushing very weird samples from old profilers.

Yes, I saw that and I think this is no big deal. OTOH inconsistent state because of concurrent access to the per-thread state would be much worse: inconsistent state and potentially segfaults (e.g. concurrent access to sampling_buffer would be pretty bad, though this PR doesn't reset that one, but still).

@eregon

eregon commented Jun 22, 2026

Copy link
Copy Markdown
Member

IOW given the per_thread_context are global, I think we should embrace it. Fighting it is not going to work.
If we wanted a new clean per_thread_context for every new ThreadContext, then we should have picked that approach, but as we saw in #5816, it does not work (too difficult to stop the previous ThreadContext/Worker).

hayat01sh1da pushed a commit to hayat01sh1da/dd-trace-rb that referenced this pull request Jun 22, 2026
**What does this PR do?**

This PR skips a flaky spec that hit
[master](https://github.com/DataDog/dd-trace-rb/actions/runs/27937925206/job/82664232085):

```
Failures:

  1) Datadog::Profiling::Collectors::ThreadContext#sample_after_gvl_running if thread has not been sampled before does not sample the thread
     Failure/Error: expect(samples).to be_empty
       expected `[#<struct ProfileHelpers::Sample locations=[#<struct ProfileHelpers::Frame base_label="sleep", path="...26761, :state=>"sleeping", :"thread id"=>"14823 (39460)", :"thread name"=>"Timeout stdlib thread"}>].empty?` to be truthy, got false
     # ./spec/datadog/profiling/collectors/thread_context_spec.rb:1934:in `block (4 levels) in <top (required)>'
     # ./spec/spec_helper.rb:327:in `block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:207:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/webmock-3.26.2/lib/webmock/rspec.rb:39:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in `block (2 levels) in <top (required)>'
     # ./spec/support/execute_in_fork.rb:32:in `run'

Finished in 27.48 seconds (files took 1.09 seconds to load)
806 examples, 1 failure, 15 pending

Failed examples:

rspec ./spec/datadog/profiling/collectors/thread_context_spec.rb:1931 # Datadog::Profiling::Collectors::ThreadContext#sample_after_gvl_running if thread has not been sampled before does not sample the thread
```

We started fixing it in DataDog#5926
but it looks like the fix might be more involved than initially
considered so let's unblock master while we figure out a good path.

**Motivation:**

Unblock master while we figure out a better solution

**Additional Notes:**

N/A

**How to test the change?**

CI should be green, now that this test is skipped

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so I was missing a final sleep 1 on my example in the PR description.

With

Datadog.configure { |c| c.profiling.enabled = true }
Datadog::Profiling.wait_until_running
sleep 5
Datadog.configure { |c| c.profiling.enabled = false }
sleep 5
Datadog.configure { |c| c.profiling.enabled = true } # <-- profiler #2
sleep 1

I got what I feared... a 1-second profile with 6 seconds of data (representing the gap while the profiler was stopped)

Image

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the original flakiness, I could "reproduce it" with

  describe "#sample_after_gvl_running" do
    before { skip_if_gvl_profiling_not_supported(self) }

    fcontext "if thread does not have per-thread context" do
      before { remove_per_thread_context_for(t1) }

      # it do
      #   expect(sample_after_gvl_running(t1)).to be false
      # end

      it 'weird test setup' do
        sleep_thread = Thread.new { sleep }
        sample # Creates context
        on_gvl_released(sleep_thread)
        sample # Waiting starts
        sample # Sets skip
        pp per_thread_context[sleep_thread]
      end

      it "does not sample the thread" do
        # skip("This is flaky -- we're discussing a full fix in https://github.com/DataDog/dd-trace-rb/pull/5926 but for now let's skip")

        sample_after_gvl_running(t1)

        expect(samples).to be_empty
      end
    end

And running with --order=defined to make sure the specs ran in this order. There is one very subtle detail here -- the was_skipped_at_last_sample gets reset when the recorder is flushed so the issue in particular that made the spec flaky might not happen in practice in production since when we stop the profiler we flush the recorder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

profiling Involves Datadog profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants