Skip to content

Add regression test for take_event/on_sample_lost lock-order inversion#278

Open
thomasmoore-torc wants to merge 3 commits into
ros2:rollingfrom
thomasmoore-torc:take_event_deadlock
Open

Add regression test for take_event/on_sample_lost lock-order inversion#278
thomasmoore-torc wants to merge 3 commits into
ros2:rollingfrom
thomasmoore-torc:take_event_deadlock

Conversation

@thomasmoore-torc

Copy link
Copy Markdown

Description

Adds test_event_message_lost_deadlock, which reproduces an AB-BA lock-order inversion in the rmw_fastrtps subscription QoS-event path:

  • Executor side: rmw_take_event(MESSAGE_LOST) locks the rmw event mutex, then queries the reader's sample-lost status, which locks the DataReader mutex. (E -> R)
  • DDS receive side: the reader holds its mutex while delivering on_sample_lost, which calls back into rmw and locks the event mutex. (R -> E)

Run concurrently under real SAMPLE_LOST these orderings deadlock. The test arms the MESSAGE_LOST callback (the events-executor path), floods a depth-1 best-effort subscription to force SAMPLE_LOST, and runs a rmw_take_event loop concurrently; a watchdog reports the deadlock if the take loop stalls. An events_seen > 0 guard rejects a misleading pass when no loss was generated.

Requires intra-process delivery to be disabled, so the test is registered with a Fast DDS profile (no_intraprocess_profile.xml) that sets intraprocess_delivery OFF; intra-process delivery hands samples over inline and never loses them. Registered only for the fastrtps variants.

This is the event-path analog of the already-fixed data-path inversion (ros2/rmw_fastrtps#657). The fix lives in ros2/rmw_fastrtps (take_event / set_on_new_event_callback no longer hold the event mutex across the reader status query): unpatched rolling deadlocks (test fails / times out), the fixed branch passes.

This test is expected to fail until ros2/rmw_fastrtps#890 is merged.

Is this user-facing behavior change?

Did you use Generative AI?

This test was generated using Claude Opus 4.8 (1M context).

Additional Information

Here's the relevant output of running the test with TSAN enabled without the fix incorporated:

WARNING: ThreadSanitizer: lock-order-inversion (potential deadlock) (pid=1759)
  Cycle in lock order graph: M0 (0x725c00002868) => M1 (0x725800003d88) => M0

  Mutex M1 acquired here while holding mutex M0 in main thread:
    #0 pthread_mutex_lock ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1341 (libtsan.so.2+0x59a13) (BuildId: 2a13a7710e361d06f7babbea53065ca2be93f738)
    #1 eprosima::fastdds::dds::DataReaderImpl::get_sample_lost_status(eprosima::fastdds::dds::BaseStatus&) <null> (libfastdds.so.3.6+0x2f5e8f) (BuildId: 4876362f9e32f84c8de4cabe21943a7e824e50df)
    #2 rmw_fastrtps_shared_cpp::__rmw_event_set_callback(rmw_event_s*, void (*)(void const*, unsigned long), void const*) /root/ws/src/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_event.cpp:160 (librmw_fastrtps_shared_cpp.so+0x7ea51) (BuildId: bf3e92c2fa8c596c57cb845fbc19a000efc9cd77)
    #3 rmw_event_set_callback /root/ws/src/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_event.cpp:59 (librmw_fastrtps_cpp.so+0x5c605) (BuildId: 9b3ff1ce4ed3e15ab82d0d4484b7111af698f51f)
    #4 TestEventMessageLostDeadlock_take_event_does_not_deadlock_with_on_sample_lost_Test::TestBody() /root/ws/src/rmw_implementation/test_rmw_implementation/test/test_event_message_lost_deadlock.cpp:135 (test_event_message_lost_deadlock+0x11c76) (BuildId: 585a1db1b0a675f95e137685aac2f5668e8228b2)
    #5 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) <null> (test_event_message_lost_deadlock+0x50dae) (BuildId: 585a1db1b0a675f95e137685aac2f5668e8228b2)

  Mutex M0 previously acquired by the same thread here:
    #0 pthread_mutex_lock ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1341 (libtsan.so.2+0x59a13) (BuildId: 2a13a7710e361d06f7babbea53065ca2be93f738)
    #1 __gthread_mutex_lock /usr/include/x86_64-linux-gnu/c++/13/bits/gthr-default.h:749 (librmw_fastrtps_shared_cpp.so+0x4db11) (BuildId: bf3e92c2fa8c596c57cb845fbc19a000efc9cd77)
    #2 std::mutex::lock() /usr/include/c++/13/bits/std_mutex.h:113 (librmw_fastrtps_shared_cpp.so+0x4db11)
    #3 std::unique_lock<std::mutex>::lock() /usr/include/c++/13/bits/unique_lock.h:141 (librmw_fastrtps_shared_cpp.so+0x4db11)
    #4 std::unique_lock<std::mutex>::unique_lock(std::mutex&) /usr/include/c++/13/bits/unique_lock.h:71 (librmw_fastrtps_shared_cpp.so+0x4db11)
    #5 rcpputils::unique_lock<std::mutex>::unique_lock(std::mutex&) /opt/ros/rolling/include/rcpputils/rcpputils/unique_lock.hpp:35 (librmw_fastrtps_shared_cpp.so+0x4db11)
    #6 RMWSubscriptionEvent::set_on_new_event_callback(rmw_event_type_e, void const*, void (*)(void const*, unsigned long)) /root/ws/src/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/custom_subscriber_info.cpp:242 (librmw_fastrtps_shared_cpp.so+0x4db11)
    #7 rmw_fastrtps_shared_cpp::__rmw_event_set_callback(rmw_event_s*, void (*)(void const*, unsigned long), void const*) /root/ws/src/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_event.cpp:160 (librmw_fastrtps_shared_cpp.so+0x7ea51) (BuildId: bf3e92c2fa8c596c57cb845fbc19a000efc9cd77)
    #8 rmw_event_set_callback /root/ws/src/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_event.cpp:59 (librmw_fastrtps_cpp.so+0x5c605) (BuildId: 9b3ff1ce4ed3e15ab82d0d4484b7111af698f51f)
    #9 TestEventMessageLostDeadlock_take_event_does_not_deadlock_with_on_sample_lost_Test::TestBody() /root/ws/src/rmw_implementation/test_rmw_implementation/test/test_event_message_lost_deadlock.cpp:135 (test_event_message_lost_deadlock+0x11c76) (BuildId: 585a1db1b0a675f95e137685aac2f5668e8228b2)
    #10 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) <null> (test_event_message_lost_deadlock+0x50dae) (BuildId: 585a1db1b0a675f95e137685aac2f5668e8228b2)

  Mutex M0 acquired here while holding mutex M1 in thread T4:
    #0 pthread_mutex_lock ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1341 (libtsan.so.2+0x59a13) (BuildId: 2a13a7710e361d06f7babbea53065ca2be93f738)
    #1 __gthread_mutex_lock /usr/include/x86_64-linux-gnu/c++/13/bits/gthr-default.h:749 (librmw_fastrtps_shared_cpp.so+0x4c95f) (BuildId: bf3e92c2fa8c596c57cb845fbc19a000efc9cd77)
    #2 std::mutex::lock() /usr/include/c++/13/bits/std_mutex.h:113 (librmw_fastrtps_shared_cpp.so+0x4c95f)
    #3 std::lock_guard<std::mutex>::lock_guard(std::mutex&) /usr/include/c++/13/bits/std_mutex.h:249 (librmw_fastrtps_shared_cpp.so+0x4c95f)
    #4 RMWSubscriptionEvent::update_sample_lost(unsigned int, unsigned int) /root/ws/src/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/custom_subscriber_info.cpp:449 (librmw_fastrtps_shared_cpp.so+0x4c95f)
    #5 CustomDataReaderListener::on_sample_lost(eprosima::fastdds::dds::DataReader*, eprosima::fastdds::dds::BaseStatus const&) /root/ws/src/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/custom_subscriber_info.cpp:97 (librmw_fastrtps_shared_cpp.so+0x4ca40) (BuildId: bf3e92c2fa8c596c57cb845fbc19a000efc9cd77)
    #6 eprosima::fastdds::dds::DataReaderImpl::InnerDataReaderListener::on_sample_lost(eprosima::fastdds::rtps::RTPSReader*, int) <null> (libfastdds.so.3.6+0x2f84de) (BuildId: 4876362f9e32f84c8de4cabe21943a7e824e50df)

  Mutex M1 previously acquired by the same thread here:
    #0 pthread_mutex_lock ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1341 (libtsan.so.2+0x59a13) (BuildId: 2a13a7710e361d06f7babbea53065ca2be93f738)
    #1 eprosima::fastdds::rtps::StatelessReader::process_data_msg(eprosima::fastdds::rtps::CacheChange_t*) <null> (libfastdds.so.3.6+0x5867d6) (BuildId: 4876362f9e32f84c8de4cabe21943a7e824e50df)

  Thread T4 'dds.shm.7411' (tid=1769, running) created by main thread at:
    #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1022 (libtsan.so.2+0x5ac1a) (BuildId: 2a13a7710e361d06f7babbea53065ca2be93f738)
    #1 eprosima::thread::start_thread_impl(int, void* (*)(void*), void*) <null> (libfastdds.so.3.6+0x60633a) (BuildId: 4876362f9e32f84c8de4cabe21943a7e824e50df)
    #2 init_context_impl /root/ws/src/rmw_fastrtps/rmw_fastrtps_cpp/src/init_rmw_context_impl.cpp:82 (librmw_fastrtps_cpp.so+0x38090) (BuildId: 9b3ff1ce4ed3e15ab82d0d4484b7111af698f51f)
    #3 rmw_fastrtps_cpp::increment_context_impl_ref_count(rmw_context_s*) /root/ws/src/rmw_fastrtps/rmw_fastrtps_cpp/src/init_rmw_context_impl.cpp:244 (librmw_fastrtps_cpp.so+0x39597) (BuildId: 9b3ff1ce4ed3e15ab82d0d4484b7111af698f51f)
    #4 rmw_create_node /root/ws/src/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_node.cpp:60 (librmw_fastrtps_cpp.so+0x5e70f) (BuildId: 9b3ff1ce4ed3e15ab82d0d4484b7111af698f51f)
    #5 TestEventMessageLostDeadlock::SetUp() /root/ws/src/rmw_implementation/test_rmw_implementation/test/test_event_message_lost_deadlock.cpp:84 (test_event_message_lost_deadlock+0x1713a) (BuildId: 585a1db1b0a675f95e137685aac2f5668e8228b2)
    #6 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) <null> (test_event_message_lost_deadlock+0x50dae) (BuildId: 585a1db1b0a675f95e137685aac2f5668e8228b2)

Comment thread test_rmw_implementation/CMakeLists.txt
@thomasmoore-torc thomasmoore-torc force-pushed the take_event_deadlock branch 2 times, most recently from 1edcf70 to 6789bfe Compare June 18, 2026 05:50
thomasmoore-torc and others added 2 commits June 18, 2026 05:54
Adds test_event_message_lost_deadlock, which reproduces an AB-BA
lock-order inversion in the rmw_fastrtps subscription QoS-event path:

  * Executor side: rmw_take_event(MESSAGE_LOST) locks the rmw event
    mutex, then queries the reader's sample-lost status, which locks the
    DataReader mutex.                                          (E -> R)
  * DDS receive side: the reader holds its mutex while delivering
    on_sample_lost, which calls back into rmw and locks the event
    mutex.                                                     (R -> E)

Run concurrently under real SAMPLE_LOST these orderings deadlock. The
test arms the MESSAGE_LOST callback (the events-executor path), floods a
depth-1 best-effort subscription to force SAMPLE_LOST, and runs a
rmw_take_event loop concurrently; a watchdog reports the deadlock if the
take loop stalls. An events_seen > 0 guard rejects a misleading pass
when no loss was generated.

Requires intra-process delivery to be disabled, so the test is
registered with a Fast DDS profile (no_intraprocess_profile.xml) that
sets intraprocess_delivery OFF; intra-process delivery hands samples
over inline and never loses them. Registered only for the fastrtps
variants.

This is the event-path analog of the already-fixed data-path inversion
(ros2/rmw_fastrtps#657). The fix lives in ros2/rmw_fastrtps
(take_event / set_on_new_event_callback no longer hold the event mutex
across the reader status query): unpatched rolling deadlocks (test
fails / times out), the fixed branch passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Thomas Moore <thomas.moore@torc.ai>
…ised

The test was registered for every rmw implementation via
call_for_each_rmw_implementation(test_api). On non-fastrtps stacks
FASTRTPS_DEFAULT_PROFILES_FILE is a no-op and a depth-1 best-effort
reader that never takes does not reliably produce SAMPLE_LOST, so
events_seen stayed 0 and the final EXPECT_GT(events_seen, 0u) failed for
a reason unrelated to any deadlock.

Guard the registration with if(rmw_implementation MATCHES "fastrtps") so
it only runs for the fastrtps variants, and change the events_seen == 0
guard from EXPECT_GT to GTEST_SKIP so an un-exercised scenario is
reported as skipped rather than a misleading pass or a spurious failure.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Thomas Moore <thomas.moore@torc.ai>
@fujitatomoya

Copy link
Copy Markdown
Collaborator

Pulls: ros2/rmw_fastrtps#890, #278
Gist: https://gist.githubusercontent.com/fujitatomoya/e30a6f121dec854fa58db56e23dd4ff1/raw/e44ebf55fe7729cb041494bb0a3b70aba1d9acab/ros2.repos
BUILD args: --packages-above-and-dependencies rmw_fastrtps_shared_cpp test_rmw_implementation
TEST args: --packages-above rmw_fastrtps_shared_cpp test_rmw_implementation
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/19606

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@MiguelCompany MiguelCompany left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the reported uncrustify errors, this LGTM

Signed-off-by: Thomas Moore <thomas.moore@torc.ai>
@mergify

mergify Bot commented Jun 19, 2026

Copy link
Copy Markdown

Tick the box to add this pull request to the merge queue (same as @mergifyio queue).

  • Queue this pull request

@fujitatomoya

Copy link
Copy Markdown
Collaborator
  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants