[NO-TICKET] Profiling: Reset leftover per-thread state when creating new ThreadContext#5926
[NO-TICKET] Profiling: Reset leftover per-thread state when creating new ThreadContext#5926ivoanjo wants to merge 1 commit into
Conversation
…new ThreadContext **What does this PR do?** Since #5816, we've been keeping the profiler's per-thread context directly attached to each Ruby thread. This PR fixes a flaky spec that hit [master](https://github.com/DataDog/dd-trace-rb/actions/runs/27937925206/job/82664232085): ``` Failures: 1) Datadog::Profiling::Collectors::ThreadContext#sample_after_gvl_running if thread has not been sampled before does not sample the thread Failure/Error: expect(samples).to be_empty expected `[#<struct ProfileHelpers::Sample locations=[#<struct ProfileHelpers::Frame base_label="sleep", path="...26761, :state=>"sleeping", :"thread id"=>"14823 (39460)", :"thread name"=>"Timeout stdlib thread"}>].empty?` to be truthy, got false # ./spec/datadog/profiling/collectors/thread_context_spec.rb:1934:in `block (4 levels) in <top (required)>' # ./spec/spec_helper.rb:327:in `block (2 levels) in <top (required)>' # ./spec/spec_helper.rb:207:in `block (2 levels) in <top (required)>' # /usr/local/bundle/gems/webmock-3.26.2/lib/webmock/rspec.rb:39:in `block (2 levels) in <top (required)>' # /usr/local/bundle/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in `block (2 levels) in <top (required)>' # ./spec/support/execute_in_fork.rb:32:in `run' Finished in 27.48 seconds (files took 1.09 seconds to load) 806 examples, 1 failure, 15 pending Failed examples: rspec ./spec/datadog/profiling/collectors/thread_context_spec.rb:1931 # Datadog::Profiling::Collectors::ThreadContext#sample_after_gvl_running if thread has not been sampled before does not sample the thread ``` This spec failed because we didn't account for the background timeout thread in our assertions, and thus we saw an extra thread that failed the assertion. Specifically, what I believe happened is that the background thread got a `was_skipped_at_last_sample` in a previous test, and that carried over to this test that assumed only one thread (`t1`) was in such a pending state. By resetting the state for a new instance of the `ThreadContext`, this issue will no longer happen. **Motivation:** Fix flaky test and what I argue is also a logic issue (see below). **Additional Notes:** When looking into this one, it dawned on me that there was a deeper problem being exposed here, which is why the solution to this problem is inside the production code in `collectors_thread_context.c` and not in the `thread_context_spec.rb`. Specifically, all of `cpu_time_at_previous_sample_ns`, `wall_time_at_previous_sample_ns`, `gvl_waiting_at`, `gvl_state_change_count`, `gvl_state_change_count_at_previous_sample`, `was_skipped_at_last_sample`, `gc_tracking.cpu_time_at_start_ns`, `gc_tracking.wall_time_at_start_ns` carry their state across profiler restarts. So a similar problem to the one that happened in our specs could in fact happen in production, where state from a previous profiler is used in sampling decisions on a new profiler. This is a bit of a corner case but consider something like: ```ruby Datadog.configure { |c| c.profiling.enabled = true } Datadog::Profiling.wait_until_running sleep 5 Datadog.configure { |c| c.profiling.enabled = false } sleep 5 Datadog.configure { |c| c.profiling.enabled = true } # <-- profiler #2 ``` I didn't try it, but I expect that profiler #2 would suddenly start using `wall_time_at_previous_sample_ns` on existing threads and assign one single sample with all of the time while the profiler was stopped; this was what effectively was happening in the test, but at a much smaller scale. This solution fixes this by forcing the new profiler to start with a clean state. **How to test the change?** Existing coverage + the new test should be enough for this one.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9ee6be197f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| for (long i = 0; i < thread_count; i++) { | ||
| VALUE thread = RARRAY_AREF(threads, i); | ||
| per_thread_context *thread_context = get_per_thread_context(thread); | ||
| if (thread_context != NULL) reset_context_state(thread_context); |
There was a problem hiding this comment.
Defer resetting shared thread state until old profiler stops
When profiling is reconfigured while still enabled, Datadog::Core::Configuration.replace_components! constructs the new Components before calling old.shutdown! (lib/datadog/core/configuration.rb:264-277), so this reset runs while the previous profiler worker/scheduler can still be sampling or serializing the same global per_thread_context objects. In that window, clearing timestamps, GVL state, and was_skipped_at_last_sample can make the old profiler lose accumulated wall/CPU time or emit an inconsistent final profile; reset this shared state only after the old profiler is stopped, or during the new profiler start path after shutdown.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Hmmm this is a good point, let me look into it
|
BenchmarksBenchmark execution time: 2026-06-22 14:23:10 Comparing candidate commit 9ee6be1 in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 48 metrics, 1 unstable metrics.
|
|
To me this is clearly a bug of the test: it "does not sample the thread" do
sample_after_gvl_running(t1)
expect(samples).to be_empty
endIt's incorrect to assert BTW we already do In general I think resetting the state is a mistake, in production we cannot/shouldn't do it (it's quite unsafe, it's too difficult to stop the previous ThreadContext/Worker and ensure no postponed job/signal/etc trigger) so our tests shouldn't either. |
See my note on the PR description -- if we don't reset the state, we'll be pushing very weird samples from old profilers. |
Yes, I saw that and I think this is no big deal. OTOH inconsistent state because of concurrent access to the per-thread state would be much worse: inconsistent state and potentially segfaults (e.g. concurrent access to |
|
IOW given the |
**What does this PR do?** This PR skips a flaky spec that hit [master](https://github.com/DataDog/dd-trace-rb/actions/runs/27937925206/job/82664232085): ``` Failures: 1) Datadog::Profiling::Collectors::ThreadContext#sample_after_gvl_running if thread has not been sampled before does not sample the thread Failure/Error: expect(samples).to be_empty expected `[#<struct ProfileHelpers::Sample locations=[#<struct ProfileHelpers::Frame base_label="sleep", path="...26761, :state=>"sleeping", :"thread id"=>"14823 (39460)", :"thread name"=>"Timeout stdlib thread"}>].empty?` to be truthy, got false # ./spec/datadog/profiling/collectors/thread_context_spec.rb:1934:in `block (4 levels) in <top (required)>' # ./spec/spec_helper.rb:327:in `block (2 levels) in <top (required)>' # ./spec/spec_helper.rb:207:in `block (2 levels) in <top (required)>' # /usr/local/bundle/gems/webmock-3.26.2/lib/webmock/rspec.rb:39:in `block (2 levels) in <top (required)>' # /usr/local/bundle/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in `block (2 levels) in <top (required)>' # ./spec/support/execute_in_fork.rb:32:in `run' Finished in 27.48 seconds (files took 1.09 seconds to load) 806 examples, 1 failure, 15 pending Failed examples: rspec ./spec/datadog/profiling/collectors/thread_context_spec.rb:1931 # Datadog::Profiling::Collectors::ThreadContext#sample_after_gvl_running if thread has not been sampled before does not sample the thread ``` We started fixing it in DataDog#5926 but it looks like the fix might be more involved than initially considered so let's unblock master while we figure out a good path. **Motivation:** Unblock master while we figure out a better solution **Additional Notes:** N/A **How to test the change?** CI should be green, now that this test is skipped
There was a problem hiding this comment.
Ok so I was missing a final sleep 1 on my example in the PR description.
With
Datadog.configure { |c| c.profiling.enabled = true }
Datadog::Profiling.wait_until_running
sleep 5
Datadog.configure { |c| c.profiling.enabled = false }
sleep 5
Datadog.configure { |c| c.profiling.enabled = true } # <-- profiler #2
sleep 1I got what I feared... a 1-second profile with 6 seconds of data (representing the gap while the profiler was stopped)
There was a problem hiding this comment.
For the original flakiness, I could "reproduce it" with
describe "#sample_after_gvl_running" do
before { skip_if_gvl_profiling_not_supported(self) }
fcontext "if thread does not have per-thread context" do
before { remove_per_thread_context_for(t1) }
# it do
# expect(sample_after_gvl_running(t1)).to be false
# end
it 'weird test setup' do
sleep_thread = Thread.new { sleep }
sample # Creates context
on_gvl_released(sleep_thread)
sample # Waiting starts
sample # Sets skip
pp per_thread_context[sleep_thread]
end
it "does not sample the thread" do
# skip("This is flaky -- we're discussing a full fix in https://github.com/DataDog/dd-trace-rb/pull/5926 but for now let's skip")
sample_after_gvl_running(t1)
expect(samples).to be_empty
end
endAnd running with --order=defined to make sure the specs ran in this order. There is one very subtle detail here -- the was_skipped_at_last_sample gets reset when the recorder is flushed so the issue in particular that made the spec flaky might not happen in practice in production since when we stop the profiler we flush the recorder.
What does this PR do?
Since #5816, we've been keeping the profiler's per-thread context directly attached to each Ruby thread.
This PR fixes a flaky spec that hit
master:
This spec failed because we didn't account for the background timeout thread in our assertions, and thus we saw an extra thread that failed the assertion.
Specifically, what I believe happened is that the background thread got a
was_skipped_at_last_samplein a previous test, and that carried over to this test that assumed only one thread (t1) was in such a pending state.By resetting the state for a new instance of the
ThreadContext, this issue will no longer happen.Motivation:
Fix flaky test and what I argue is also a logic issue (see below).
Change log entry
None. (Specifically -- the behavior being fixed was introduced in #5816 which was not shipped to customers yet, so there's no actual changelog from a fix)
Additional Notes:
When looking into this one, it dawned on me that there was a deeper problem being exposed here, which is why the solution to this problem is inside the production code in
collectors_thread_context.cand not in thethread_context_spec.rb.Specifically, all of
cpu_time_at_previous_sample_ns,wall_time_at_previous_sample_ns,gvl_waiting_at,gvl_state_change_count,gvl_state_change_count_at_previous_sample,was_skipped_at_last_sample,gc_tracking.cpu_time_at_start_ns,gc_tracking.wall_time_at_start_nscarry their state across profiler restarts.So a similar problem to the one that happened in our specs could in fact happen in production, where state from a previous profiler is used in sampling decisions on a new profiler. This is a bit of a corner case but consider something like:
I didn't try it, but I expect that profiler
#2would suddenly start usingwall_time_at_previous_sample_nson existing threads and assign one single sample with all of the time while the profiler was stopped; this was what effectively was happening in the test, but at a much smaller scale.This solution fixes this by forcing the new profiler to start with a clean state.
How to test the change?
Existing coverage + the new test should be enough for this one.