Add regression test for take_event/on_sample_lost lock-order inversion#278
Open
thomasmoore-torc wants to merge 3 commits into
Open
Add regression test for take_event/on_sample_lost lock-order inversion#278thomasmoore-torc wants to merge 3 commits into
thomasmoore-torc wants to merge 3 commits into
Conversation
6a91217 to
8d105bd
Compare
1edcf70 to
6789bfe
Compare
Adds test_event_message_lost_deadlock, which reproduces an AB-BA
lock-order inversion in the rmw_fastrtps subscription QoS-event path:
* Executor side: rmw_take_event(MESSAGE_LOST) locks the rmw event
mutex, then queries the reader's sample-lost status, which locks the
DataReader mutex. (E -> R)
* DDS receive side: the reader holds its mutex while delivering
on_sample_lost, which calls back into rmw and locks the event
mutex. (R -> E)
Run concurrently under real SAMPLE_LOST these orderings deadlock. The
test arms the MESSAGE_LOST callback (the events-executor path), floods a
depth-1 best-effort subscription to force SAMPLE_LOST, and runs a
rmw_take_event loop concurrently; a watchdog reports the deadlock if the
take loop stalls. An events_seen > 0 guard rejects a misleading pass
when no loss was generated.
Requires intra-process delivery to be disabled, so the test is
registered with a Fast DDS profile (no_intraprocess_profile.xml) that
sets intraprocess_delivery OFF; intra-process delivery hands samples
over inline and never loses them. Registered only for the fastrtps
variants.
This is the event-path analog of the already-fixed data-path inversion
(ros2/rmw_fastrtps#657). The fix lives in ros2/rmw_fastrtps
(take_event / set_on_new_event_callback no longer hold the event mutex
across the reader status query): unpatched rolling deadlocks (test
fails / times out), the fixed branch passes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Thomas Moore <thomas.moore@torc.ai>
…ised The test was registered for every rmw implementation via call_for_each_rmw_implementation(test_api). On non-fastrtps stacks FASTRTPS_DEFAULT_PROFILES_FILE is a no-op and a depth-1 best-effort reader that never takes does not reliably produce SAMPLE_LOST, so events_seen stayed 0 and the final EXPECT_GT(events_seen, 0u) failed for a reason unrelated to any deadlock. Guard the registration with if(rmw_implementation MATCHES "fastrtps") so it only runs for the fastrtps variants, and change the events_seen == 0 guard from EXPECT_GT to GTEST_SKIP so an un-exercised scenario is reported as skipped rather than a misleading pass or a spurious failure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Thomas Moore <thomas.moore@torc.ai>
6789bfe to
c3be8d7
Compare
Collaborator
|
Pulls: ros2/rmw_fastrtps#890, #278 |
MiguelCompany
suggested changes
Jun 19, 2026
MiguelCompany
left a comment
Contributor
There was a problem hiding this comment.
Apart from the reported uncrustify errors, this LGTM
Signed-off-by: Thomas Moore <thomas.moore@torc.ai>
|
Tick the box to add this pull request to the merge queue (same as
|
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds
test_event_message_lost_deadlock, which reproduces an AB-BA lock-order inversion in thermw_fastrtpssubscription QoS-event path:rmw_take_event(MESSAGE_LOST)locks the rmw event mutex, then queries the reader's sample-lost status, which locks the DataReader mutex. (E -> R)Run concurrently under real
SAMPLE_LOSTthese orderings deadlock. The test arms theMESSAGE_LOSTcallback (the events-executor path), floods a depth-1 best-effort subscription to forceSAMPLE_LOST, and runs armw_take_eventloop concurrently; a watchdog reports the deadlock if the take loop stalls. Anevents_seen > 0guard rejects a misleading pass when no loss was generated.Requires intra-process delivery to be disabled, so the test is registered with a Fast DDS profile (
no_intraprocess_profile.xml) that setsintraprocess_deliveryOFF; intra-process delivery hands samples over inline and never loses them. Registered only for the fastrtps variants.This is the event-path analog of the already-fixed data-path inversion (ros2/rmw_fastrtps#657). The fix lives in
ros2/rmw_fastrtps(take_event/set_on_new_event_callbackno longer hold the event mutex across the reader status query): unpatched rolling deadlocks (test fails / times out), the fixed branch passes.This test is expected to fail until ros2/rmw_fastrtps#890 is merged.
Is this user-facing behavior change?
Did you use Generative AI?
This test was generated using Claude Opus 4.8 (1M context).
Additional Information
Here's the relevant output of running the test with TSAN enabled without the fix incorporated: