Skip to content

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Jun 26, 2025

Why are these changes needed?

I found that the TSAN check for NodeManagerTest becomes flaky after #54097. Refer to the history below; the first attempt was unsuccessful.

image

NodeManagerTest can still fail the TSAN check sometimes at the following code path, which is different from the one mentioned at #54097 (comment):

[2025-06-26T00:54:16Z] ==================
[2025-06-26T00:54:16Z] WARNING: ThreadSanitizer: data race (pid=1488)
[2025-06-26T00:54:16Z]   Read of size 8 at 0x7b18000035a8 by thread T29:
[2025-06-26T00:54:16Z]     #0 absl::lts_20230802::container_internal::GroupSse2Impl::GroupSse2Impl(absl::lts_20230802::container_internal::ctrl_t const*) /proc/self/cwd/external/com_google_absl/absl/container/internal/raw_hash_set.h:589:12 (liblibraylet_Ulib.so+0x214ce7)
[2025-06-26T00:54:16Z]     #1 absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<int, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > >, absl::lts_20230802::hash_internal::Hash<int>, std::__1::equal_to<int>, std::__1::allocator<std::__1::pair<int const, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > > > >::iterator::skip_empty_or_deleted() /proc/self/cwd/external/com_google_absl/absl/container/internal/raw_hash_set.h:1648:26 (liblibraylet_Ulib.so+0x214ce7)
[2025-06-26T00:54:16Z]     #2 absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<int, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > >, absl::lts_20230802::hash_internal::Hash<int>, std::__1::equal_to<int>, std::__1::allocator<std::__1::pair<int const, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > > > >::begin() /proc/self/cwd/external/com_google_absl/absl/container/internal/raw_hash_set.h:1898:8 (liblibraylet_Ulib.so+0x214ce7)
[2025-06-26T00:54:16Z]     #3 void ray::erase_if<int, std::__1::shared_ptr<ray::raylet::internal::Work> >(absl::lts_20230802::flat_hash_map<int, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > >, absl::lts_20230802::container_internal::HashEq<int, void>::Hash, absl::lts_20230802::container_internal::HashEq<int, void>::Eq, std::__1::allocator<std::__1::pair<int const, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > > > >&, std::__1::function<bool (std::__1::shared_ptr<ray::raylet::internal::Work> const&)>) /proc/self/cwd/bazel-out/k8-opt/bin/src/ray/util/_virtual_includes/container_util/ray/util/container_util.h:180:26 (liblibraylet_Ulib.so+0x214ce7)
[2025-06-26T00:54:16Z]     #4 ray::raylet::LocalTaskManager::CancelTasks(std::__1::function<bool (std::__1::shared_ptr<ray::raylet::internal::Work> const&)>, ray::rpc::RequestWorkerLeaseReply_SchedulingFailureType, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) /proc/self/cwd/src/ray/raylet/local_task_manager.cc:856:3 (liblibraylet_Ulib.so+0x2149dc)
[2025-06-26T00:54:16Z]     #5 ray::raylet::ClusterTaskManager::CancelTasks(std::__1::function<bool (std::__1::shared_ptr<ray::raylet::internal::Work> const&)>, ray::rpc::RequestWorkerLeaseReply_SchedulingFailureType, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) /proc/self/cwd/src/ray/raylet/scheduling/cluster_task_manager.cc:122:27 (libsrc_Sray_Sraylet_Sscheduling_Slibcluster_Utask_Umanager.so+0x74aed)
[2025-06-26T00:54:16Z]     #6 ray::raylet::ClusterTaskManager::CancelAllTasksOwnedBy(ray::WorkerID const&, ray::rpc::RequestWorkerLeaseReply_SchedulingFailureType, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) /proc/self/cwd/src/ray/raylet/scheduling/cluster_task_manager.cc:197:10 (libsrc_Sray_Sraylet_Sscheduling_Slibcluster_Utask_Umanager.so+0x75dfd)
[2025-06-26T00:54:16Z]     #7 ray::raylet::NodeManager::HandleUnexpectedWorkerFailure(ray::WorkerID const&) /proc/self/cwd/src/ray/raylet/node_manager.cc:902:25 (liblibraylet_Ulib.so+0x23d9c5)
[2025-06-26T00:54:16Z]     #8 ray::raylet::NodeManager::RegisterGcs()::$_11::operator()(ray::rpc::WorkerDeltaData const&) const /proc/self/cwd/src/ray/raylet/node_manager.cc:285:9 (liblibraylet_Ulib.so+0x30d72b)
[2025-06-26T00:54:16Z]     #9 decltype(std::__1::forward<ray::raylet::NodeManager::RegisterGcs()::$_11&>(fp)(std::__1::forward<ray::rpc::WorkerDeltaData>(fp0))) std::__1::__invoke<ray::raylet::NodeManager::RegisterGcs()::$_11&, ray::rpc::WorkerDeltaData>(ray::raylet::NodeManager::RegisterGcs()::$_11&, ray::rpc::WorkerDeltaData&&) /opt/llvm/bin/../include/c++/v1/type_traits:3694:1 (liblibraylet_Ulib.so+0x30d72b)
[2025-06-26T00:54:16Z]     #10 void std::__1::__invoke_void_return_wrapper<void, true>::__call<ray::raylet::NodeManager::RegisterGcs()::$_11&, ray::rpc::WorkerDeltaData>(ray::raylet::NodeManager::RegisterGcs()::$_11&, ray::rpc::WorkerDeltaData&&) /opt/llvm/bin/../include/c++/v1/__functional_base:348:9 (liblibraylet_Ulib.so+0x30d72b)
[2025-06-26T00:54:16Z]     #11 std::__1::__function::__alloc_func<ray::raylet::NodeManager::RegisterGcs()::$_11, std::__1::allocator<ray::raylet::NodeManager::RegisterGcs()::$_11>, void (ray::rpc::WorkerDeltaData&&)>::operator()(ray::rpc::WorkerDeltaData&&) /opt/llvm/bin/../include/c++/v1/functional:1558:16 (liblibraylet_Ulib.so+0x30d72b)
[2025-06-26T00:54:16Z]     #12 std::__1::__function::__func<ray::raylet::NodeManager::RegisterGcs()::$_11, std::__1::allocator<ray::raylet::NodeManager::RegisterGcs()::$_11>, void (ray::rpc::WorkerDeltaData&&)>::operator()(ray::rpc::WorkerDeltaData&&) /opt/llvm/bin/../include/c++/v1/functional:1732:12 (liblibraylet_Ulib.so+0x30d72b)
[2025-06-26T00:54:16Z]     #13 std::__1::__function::__value_func<void (ray::rpc::WorkerDeltaData&&)>::operator()(ray::rpc::WorkerDeltaData&&) const /opt/llvm/bin/../include/c++/v1/functional:1885:16 (node_manager_test+0x461e33)
[2025-06-26T00:54:16Z]     #14 std::__1::function<void (ray::rpc::WorkerDeltaData&&)>::operator()(ray::rpc::WorkerDeltaData&&) const /opt/llvm/bin/../include/c++/v1/functional:2560:12 (node_manager_test+0x461e33)
[2025-06-26T00:54:16Z]     #15 ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_8::operator()() const /proc/self/cwd/src/ray/raylet/test/node_manager_test.cc:513:5 (node_manager_test+0x461e33)
[2025-06-26T00:54:16Z]     #16 decltype(std::__1::forward<ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_8>(fp)()) std::__1::__invoke<ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_8>(ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_8&&) /opt/llvm/bin/../include/c++/v1/type_traits:3694:1 (node_manager_test+0x461e33)
[2025-06-26T00:54:16Z]     #17 void std::__1::__thread_execute<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_8>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_8>&, std::__1::__tuple_indices<>) /opt/llvm/bin/../include/c++/v1/thread:280:5 (node_manager_test+0x461e33)
[2025-06-26T00:54:16Z]     #18 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_8> >(void*) /opt/llvm/bin/../include/c++/v1/thread:291:5 (node_manager_test+0x461e33)
[2025-06-26T00:54:16Z] 
[2025-06-26T00:54:16Z]   Previous write of size 1 at 0x7b18000035aa by thread T28:
[2025-06-26T00:54:16Z]     #0 absl::lts_20230802::container_internal::SetCtrl(absl::lts_20230802::container_internal::CommonFields const&, unsigned long, absl::lts_20230802::container_internal::ctrl_t, unsigned long) /proc/self/cwd/external/com_google_absl/absl/container/internal/raw_hash_set.h:1375:77 (libexternal_Scom_Ugoogle_Uabsl_Sabsl_Scontainer_Slibraw_Uhash_Uset.so+0x171d5)
[2025-06-26T00:54:16Z]     #1 absl::lts_20230802::container_internal::EraseMetaOnly(absl::lts_20230802::container_internal::CommonFields&, absl::lts_20230802::container_internal::ctrl_t*, unsigned long) /proc/self/cwd/external/com_google_absl/absl/container/internal/raw_hash_set.cc:237:3 (libexternal_Scom_Ugoogle_Uabsl_Sabsl_Scontainer_Slibraw_Uhash_Uset.so+0x171d5)
[2025-06-26T00:54:16Z]     #2 absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<int, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > >, absl::lts_20230802::hash_internal::Hash<int>, std::__1::equal_to<int>, std::__1::allocator<std::__1::pair<int const, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > > > >::erase_meta_only(absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<int, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > >, absl::lts_20230802::hash_internal::Hash<int>, std::__1::equal_to<int>, std::__1::allocator<std::__1::pair<int const, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > > > >::const_iterator) /proc/self/cwd/external/com_google_absl/absl/container/internal/raw_hash_set.h:2491:5 (liblibraylet_Ulib.so+0x2126f2)
[2025-06-26T00:54:16Z]     #3 absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<int, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > >, absl::lts_20230802::hash_internal::Hash<int>, std::__1::equal_to<int>, std::__1::allocator<std::__1::pair<int const, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > > > >::erase(absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<int, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > >, absl::lts_20230802::hash_internal::Hash<int>, std::__1::equal_to<int>, std::__1::allocator<std::__1::pair<int const, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > > > >::iterator) /proc/self/cwd/external/com_google_absl/absl/container/internal/raw_hash_set.h:2184:5 (liblibraylet_Ulib.so+0x2126f2)
[2025-06-26T00:54:16Z]     #4 ray::raylet::LocalTaskManager::PoppedWorkerHandler(std::__1::shared_ptr<ray::raylet::WorkerInterface>, ray::raylet::PopWorkerStatus, ray::TaskID const&, int, std::__1::shared_ptr<ray::raylet::internal::Work> const&, bool, ray::rpc::Address const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)::$_2::operator()(std::__1::shared_ptr<ray::raylet::internal::Work> const&, int const&) const /proc/self/cwd/src/ray/raylet/local_task_manager.cc:574:26 (liblibraylet_Ulib.so+0x2126f2)
[2025-06-26T00:54:16Z]     #5 ray::raylet::LocalTaskManager::PoppedWorkerHandler(std::__1::shared_ptr<ray::raylet::WorkerInterface>, ray::raylet::PopWorkerStatus, ray::TaskID const&, int, std::__1::shared_ptr<ray::raylet::internal::Work> const&, bool, ray::rpc::Address const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) /proc/self/cwd/src/ray/raylet/local_task_manager.cc:647:5 (liblibraylet_Ulib.so+0x2119c1)
[2025-06-26T00:54:16Z]     #6 ray::raylet::LocalTaskManager::DispatchScheduledTasksToWorkers()::$_1::operator()(std::__1::shared_ptr<ray::raylet::WorkerInterface>, ray::raylet::PopWorkerStatus, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) const /proc/self/cwd/src/ray/raylet/local_task_manager.cc:382:22 (liblibraylet_Ulib.so+0x22246b)
[2025-06-26T00:54:16Z]     #7 decltype(std::__1::forward<ray::raylet::LocalTaskManager::DispatchScheduledTasksToWorkers()::$_1&>(fp)(std::__1::forward<std::__1::shared_ptr<ray::raylet::WorkerInterface> const&>(fp0), std::__1::forward<ray::raylet::PopWorkerStatus>(fp0), std::__1::forward<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&>(fp0))) std::__1::__invoke<ray::raylet::LocalTaskManager::DispatchScheduledTasksToWorkers()::$_1&, std::__1::shared_ptr<ray::raylet::WorkerInterface> const&, ray::raylet::PopWorkerStatus, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&>(ray::raylet::LocalTaskManager::DispatchScheduledTasksToWorkers()::$_1&, std::__1::shared_ptr<ray::raylet::WorkerInterface> const&, ray::raylet::PopWorkerStatus&&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) /opt/llvm/bin/../include/c++/v1/type_traits:3694:1 (liblibraylet_Ulib.so+0x22246b)
[2025-06-26T00:54:16Z]     #8 bool std::__1::__invoke_void_return_wrapper<bool, false>::__call<ray::raylet::LocalTaskManager::DispatchScheduledTasksToWorkers()::$_1&, std::__1::shared_ptr<ray::raylet::WorkerInterface> const&, ray::raylet::PopWorkerStatus, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&>(ray::raylet::LocalTaskManager::DispatchScheduledTasksToWorkers()::$_1&, std::__1::shared_ptr<ray::raylet::WorkerInterface> const&, ray::raylet::PopWorkerStatus&&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) /opt/llvm/bin/../include/c++/v1/__functional_base:317:16 (liblibraylet_Ulib.so+0x22246b)
[2025-06-26T00:54:16Z]     #9 std::__1::__function::__alloc_func<ray::raylet::LocalTaskManager::DispatchScheduledTasksToWorkers()::$_1, std::__1::allocator<ray::raylet::LocalTaskManager::DispatchScheduledTasksToWorkers()::$_1>, bool (std::__1::shared_ptr<ray::raylet::WorkerInterface> const&, ray::raylet::PopWorkerStatus, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)>::operator()(std::__1::shared_ptr<ray::raylet::WorkerInterface> const&, ray::raylet::PopWorkerStatus&&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) /opt/llvm/bin/../include/c++/v1/functional:1558:16 (liblibraylet_Ulib.so+0x22246b)
[2025-06-26T00:54:16Z]     #10 std::__1::__function::__func<ray::raylet::LocalTaskManager::DispatchScheduledTasksToWorkers()::$_1, std::__1::allocator<ray::raylet::LocalTaskManager::DispatchScheduledTasksToWorkers()::$_1>, bool (std::__1::shared_ptr<ray::raylet::WorkerInterface> const&, ray::raylet::PopWorkerStatus, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)>::operator()(std::__1::shared_ptr<ray::raylet::WorkerInterface> const&, ray::raylet::PopWorkerStatus&&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) /opt/llvm/bin/../include/c++/v1/functional:1732:12 (liblibraylet_Ulib.so+0x22246b)
[2025-06-26T00:54:16Z]     #11 std::__1::__function::__value_func<bool (std::__1::shared_ptr<ray::raylet::WorkerInterface> const&, ray::raylet::PopWorkerStatus, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)>::operator()(std::__1::shared_ptr<ray::raylet::WorkerInterface> const&, ray::raylet::PopWorkerStatus&&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) const /opt/llvm/bin/../include/c++/v1/functional:1885:16 (node_manager_test+0x4643aa)
[2025-06-26T00:54:16Z]     #12 std::__1::function<bool (std::__1::shared_ptr<ray::raylet::WorkerInterface> const&, ray::raylet::PopWorkerStatus, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)>::operator()(std::__1::shared_ptr<ray::raylet::WorkerInterface> const&, ray::raylet::PopWorkerStatus, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) const /opt/llvm/bin/../include/c++/v1/functional:2560:12 (node_manager_test+0x4643aa)
[2025-06-26T00:54:16Z]     #13 ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_9::operator()() const /proc/self/cwd/src/ray/raylet/test/node_manager_test.cc:525:26 (node_manager_test+0x4643aa)
[2025-06-26T00:54:16Z]     #14 decltype(std::__1::forward<ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_9&>(fp)()) std::__1::__invoke<ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_9&>(ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_9&) /opt/llvm/bin/../include/c++/v1/type_traits:3694:1 (node_manager_test+0x4643aa)
[2025-06-26T00:54:16Z]     #15 void std::__1::__invoke_void_return_wrapper<void, true>::__call<ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_9&>(ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_9&) /opt/llvm/bin/../include/c++/v1/__functional_base:348:9 (node_manager_test+0x4643aa)
[2025-06-26T00:54:16Z]     #16 std::__1::__function::__alloc_func<ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_9, std::__1::allocator<ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_9>, void ()>::operator()() /opt/llvm/bin/../include/c++/v1/functional:1558:16 (node_manager_test+0x4643aa)
[2025-06-26T00:54:16Z]     #17 std::__1::__function::__func<ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_9, std::__1::allocator<ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_9>, void ()>::operator()() /opt/llvm/bin/../include/c++/v1/functional:1732:12 (node_manager_test+0x4643aa)
[2025-06-26T00:54:16Z]     #18 std::__1::__function::__value_func<void ()>::operator()() const /opt/llvm/bin/../include/c++/v1/functional:1885:16 (libsrc_Sray_Scommon_Slibevent_Ustats.so+0x6a751)
[2025-06-26T00:54:16Z]     #19 std::__1::function<void ()>::operator()() const /opt/llvm/bin/../include/c++/v1/functional:2560:12 (libsrc_Sray_Scommon_Slibevent_Ustats.so+0x6a751)
[2025-06-26T00:54:16Z]     #20 EventTracker::RecordExecution(std::__1::function<void ()> const&, std::__1::shared_ptr<StatsHandle>) /proc/self/cwd/src/ray/common/event_stats.cc:113:3 (libsrc_Sray_Scommon_Slibevent_Ustats.so+0x6a751)
[2025-06-26T00:54:16Z]     #21 instrumented_io_context::post(std::__1::function<void ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long)::$_0::operator()() /proc/self/cwd/src/ray/common/asio/instrumented_io_context.cc:99:7 (libsrc_Sray_Scommon_Slibasio.so+0x9627c)
[2025-06-26T00:54:16Z]     #22 decltype(std::__1::forward<instrumented_io_context::post(std::__1::function<void ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long)::$_0&>(fp)()) std::__1::__invoke<instrumented_io_context::post(std::__1::function<void ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long)::$_0&>(instrumented_io_context::post(std::__1::function<void ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long)::$_0&) /opt/llvm/bin/../include/c++/v1/type_traits:3694:1 (libsrc_Sray_Scommon_Slibasio.so+0x9627c)
[2025-06-26T00:54:16Z]     #23 void std::__1::__invoke_void_return_wrapper<void, true>::__call<instrumented_io_context::post(std::__1::function<void ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long)::$_0&>(instrumented_io_context::post(std::__1::function<void ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long)::$_0&) /opt/llvm/bin/../include/c++/v1/__functional_base:348:9 (libsrc_Sray_Scommon_Slibasio.so+0x9627c)
[2025-06-26T00:54:16Z]     #24 std::__1::__function::__alloc_func<instrumented_io_context::post(std::__1::function<void ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long)::$_0, std::__1::allocator<instrumented_io_context::post(std::__1::function<void ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long)::$_0>, void ()>::operator()() /opt/llvm/bin/../include/c++/v1/functional:1558:16 (libsrc_Sray_Scommon_Slibasio.so+0x9627c)
[2025-06-26T00:54:16Z]     #25 std::__1::__function::__func<instrumented_io_context::post(std::__1::function<void ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long)::$_0, std::__1::allocator<instrumented_io_context::post(std::__1::function<void ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long)::$_0>, void ()>::operator()() /opt/llvm/bin/../include/c++/v1/functional:1732:12 (libsrc_Sray_Scommon_Slibasio.so+0x9627c)
[2025-06-26T00:54:16Z]     #26 std::__1::__function::__value_func<void ()>::operator()() const /opt/llvm/bin/../include/c++/v1/functional:1885:16 (libsrc_Sray_Scommon_Slibasio.so+0x9506e)
[2025-06-26T00:54:16Z]     #27 std::__1::function<void ()>::operator()() const /opt/llvm/bin/../include/c++/v1/functional:2560:12 (libsrc_Sray_Scommon_Slibasio.so+0x9506e)
[2025-06-26T00:54:16Z]     #28 boost::asio::detail::binder0<std::__1::function<void ()> >::operator()() /proc/self/cwd/external/boost/boost/asio/detail/bind_handler.hpp:60:5 (libsrc_Sray_Scommon_Slibasio.so+0x9506e)
[2025-06-26T00:54:16Z]     #29 void boost::asio::asio_handler_invoke<boost::asio::detail::binder0<std::__1::function<void ()> > >(boost::asio::detail::binder0<std::__1::function<void ()> >&, ...) /proc/self/cwd/external/boost/boost/asio/handler_invoke_hook.hpp:88:3 (libsrc_Sray_Scommon_Slibasio.so+0x9506e)
[2025-06-26T00:54:16Z]     #30 void boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder0<std::__1::function<void ()> >, std::__1::function<void ()> >(boost::asio::detail::binder0<std::__1::function<void ()> >&, std::__1::function<void ()>&) /proc/self/cwd/external/boost/boost/asio/detail/handler_invoke_helpers.hpp:54:3 (libsrc_Sray_Scommon_Slibasio.so+0x9506e)
[2025-06-26T00:54:16Z]     #31 void boost::asio::detail::asio_handler_invoke<boost::asio::detail::binder0<std::__1::function<void ()> >, std::__1::function<void ()> >(boost::asio::detail::binder0<std::__1::function<void ()> >&, boost::asio::detail::binder0<std::__1::function<void ()> >*) /proc/self/cwd/external/boost/boost/asio/detail/bind_handler.hpp:111:3 (libsrc_Sray_Scommon_Slibasio.so+0x9506e)
[2025-06-26T00:54:16Z]     #32 void boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder0<std::__1::function<void ()> >, boost::asio::detail::binder0<std::__1::function<void ()> > >(boost::asio::detail::binder0<std::__1::function<void ()> >&, boost::asio::detail::binder0<std::__1::function<void ()> >&) /proc/self/cwd/external/boost/boost/asio/detail/handler_invoke_helpers.hpp:54:3 (libsrc_Sray_Scommon_Slibasio.so+0x9506e)
[2025-06-26T00:54:16Z]     #33 boost::asio::detail::executor_op<boost::asio::detail::binder0<std::__1::function<void ()> >, std::__1::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) /proc/self/cwd/external/boost/boost/asio/detail/executor_op.hpp:70:7 (libsrc_Sray_Scommon_Slibasio.so+0x9506e)
[2025-06-26T00:54:16Z]     #34 boost::asio::detail::scheduler_operation::complete(void*, boost::system::error_code const&, unsigned long) /proc/self/cwd/external/boost/boost/asio/detail/scheduler_operation.hpp:40:5 (libexternal_Sboost_Slibasio.so+0xa9840)
[2025-06-26T00:54:16Z]     #35 boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) /proc/self/cwd/external/boost/boost/asio/detail/impl/scheduler.ipp:492:12 (libexternal_Sboost_Slibasio.so+0xa9840)
[2025-06-26T00:54:16Z]     #36 boost::asio::detail::scheduler::run(boost::system::error_code&) /proc/self/cwd/external/boost/boost/asio/detail/impl/scheduler.ipp:210:10 (libexternal_Sboost_Slibasio.so+0x96501)
[2025-06-26T00:54:16Z]     #37 boost::asio::io_context::run() /proc/self/cwd/external/boost/boost/asio/impl/io_context.ipp:63:24 (libexternal_Sboost_Slibasio.so+0x9638e)
[2025-06-26T00:54:16Z]     #38 ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_7::operator()() const /proc/self/cwd/src/ray/raylet/test/node_manager_test.cc:480:17 (node_manager_test+0x461784)
[2025-06-26T00:54:16Z]     #39 decltype(std::__1::forward<ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_7>(fp)()) std::__1::__invoke<ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_7>(ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_7&&) /opt/llvm/bin/../include/c++/v1/type_traits:3694:1 (node_manager_test+0x461784)
[2025-06-26T00:54:16Z]     #40 void std::__1::__thread_execute<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_7>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_7>&, std::__1::__tuple_indices<>) /opt/llvm/bin/../include/c++/v1/thread:280:5 (node_manager_test+0x461784)
[2025-06-26T00:54:16Z]     #41 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_7> >(void*) /opt/llvm/bin/../include/c++/v1/thread:291:5 (node_manager_test+0x461784)
[2025-06-26T00:54:16Z] 
[2025-06-26T00:54:16Z]   Location is heap block of size 88 at 0x7b18000035a0 allocated by thread T28:
[2025-06-26T00:54:16Z]     #0 malloc /tmp/llvm/utils/release/final/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:651:5 (node_manager_test+0x15421b)
[2025-06-26T00:54:16Z]     #1 operator new(unsigned long) <null> (liblibraylet_Ulib.so+0x3c9cd4)
[2025-06-26T00:54:16Z]     #2 absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<int, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > >, absl::lts_20230802::hash_internal::Hash<int>, std::__1::equal_to<int>, std::__1::allocator<std::__1::pair<int const, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > > > >::initialize_slots() /proc/self/cwd/external/com_google_absl/absl/container/internal/raw_hash_set.h:2505:5 (liblibraylet_Ulib.so+0x21d43a)
[2025-06-26T00:54:16Z]     #3 absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<int, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > >, absl::lts_20230802::hash_internal::Hash<int>, std::__1::equal_to<int>, std::__1::allocator<std::__1::pair<int const, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > > > >::resize(unsigned long) /proc/self/cwd/external/com_google_absl/absl/container/internal/raw_hash_set.h:2515:5 (liblibraylet_Ulib.so+0x21d43a)
[2025-06-26T00:54:16Z]     #4 absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<int, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > >, absl::lts_20230802::hash_internal::Hash<int>, std::__1::equal_to<int>, std::__1::allocator<std::__1::pair<int const, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > > > >::rehash_and_grow_if_necessary() /proc/self/cwd/external/com_google_absl/absl/container/internal/raw_hash_set.h:2603:7 (liblibraylet_Ulib.so+0x21d31a)
[2025-06-26T00:54:16Z]     #5 absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<int, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > >, absl::lts_20230802::hash_internal::Hash<int>, std::__1::equal_to<int>, std::__1::allocator<std::__1::pair<int const, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > > > >::prepare_insert(unsigned long) /proc/self/cwd/external/com_google_absl/absl/container/internal/raw_hash_set.h:2678:7 (liblibraylet_Ulib.so+0x21d31a)
[2025-06-26T00:54:16Z]     #6 std::__1::pair<unsigned long, bool> absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<int, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > >, absl::lts_20230802::hash_internal::Hash<int>, std::__1::equal_to<int>, std::__1::allocator<std::__1::pair<int const, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > > > >::find_or_prepare_insert<int>(int const&) /proc/self/cwd/external/com_google_absl/absl/container/internal/raw_hash_set.h:2659:13 (liblibraylet_Ulib.so+0x20aeb8)
[2025-06-26T00:54:16Z]     #7 std::__1::pair<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<int, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > >, absl::lts_20230802::hash_internal::Hash<int>, std::__1::equal_to<int>, std::__1::allocator<std::__1::pair<int const, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > > > >::iterator, bool> absl::lts_20230802::container_internal::raw_hash_map<absl::lts_20230802::container_internal::FlatHashMapPolicy<int, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > >, absl::lts_20230802::hash_internal::Hash<int>, std::__1::equal_to<int>, std::__1::allocator<std::__1::pair<int const, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > > > >::try_emplace_impl<int const&>(int const&) /proc/self/cwd/external/com_google_absl/absl/container/internal/raw_hash_map.h:202:22 (liblibraylet_Ulib.so+0x20aeb8)
[2025-06-26T00:54:16Z]     #8 std::__1::pair<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<int, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > >, absl::lts_20230802::hash_internal::Hash<int>, std::__1::equal_to<int>, std::__1::allocator<std::__1::pair<int const, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > > > >::iterator, bool> absl::lts_20230802::container_internal::raw_hash_map<absl::lts_20230802::container_internal::FlatHashMapPolicy<int, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > >, absl::lts_20230802::hash_internal::Hash<int>, std::__1::equal_to<int>, std::__1::allocator<std::__1::pair<int const, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > > > >::try_emplace<int, 0>(int const&) /proc/self/cwd/external/com_google_absl/absl/container/internal/raw_hash_map.h:139:12 (liblibraylet_Ulib.so+0x20aeb8)
[2025-06-26T00:54:16Z]     #9 decltype(absl::lts_20230802::container_internal::FlatHashMapPolicy<int, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > >::value(std::__1::pair<int const, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > >* std::__1::addressof<std::__1::pair<int const, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > > >(std::__1::pair<int const, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > >&)(decltype(std::__1::__declval<std::__1::pair<int const, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > > >(0)) std::__1::declval<std::__1::pair<int const, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > >&>()()))) absl::lts_20230802::container_internal::raw_hash_map<absl::lts_20230802::container_internal::FlatHashMapPolicy<int, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > >, absl::lts_20230802::hash_internal::Hash<int>, std::__1::equal_to<int>, std::__1::allocator<std::__1::pair<int const, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > > > >::operator[]<int, absl::lts_20230802::container_internal::FlatHashMapPolicy<int, std::__1::deque<std::__1::shared_ptr<ray::raylet::internal::Work>, std::__1::allocator<std::__1::shared_ptr<ray::raylet::internal::Work> > > > >(int const&) /proc/self/cwd/external/com_google_absl/absl/container/internal/raw_hash_map.h:184:28 (liblibraylet_Ulib.so+0x20aeb8)
[2025-06-26T00:54:16Z]     #10 ray::raylet::LocalTaskManager::WaitForTaskArgsRequests(std::__1::shared_ptr<ray::raylet::internal::Work>) /proc/self/cwd/src/ray/raylet/local_task_manager.cc:105:5 (liblibraylet_Ulib.so+0x20aeb8)
[2025-06-26T00:54:16Z]     #11 ray::raylet::LocalTaskManager::QueueAndScheduleTask(std::__1::shared_ptr<ray::raylet::internal::Work>) /proc/self/cwd/src/ray/raylet/local_task_manager.cc:79:3 (liblibraylet_Ulib.so+0x20a08b)
[2025-06-26T00:54:16Z]     #12 ray::raylet::ClusterTaskManager::ScheduleOnNode(ray::NodeID const&, std::__1::shared_ptr<ray::raylet::internal::Work> const&) /proc/self/cwd/src/ray/raylet/scheduling/cluster_task_manager.cc:425:25 (libsrc_Sray_Sraylet_Sscheduling_Slibcluster_Utask_Umanager.so+0x7884a)
[2025-06-26T00:54:16Z]     #13 ray::raylet::ClusterTaskManager::ScheduleAndDispatchTasks() /proc/self/cwd/src/ray/raylet/scheduling/cluster_task_manager.cc:262:7 (libsrc_Sray_Sraylet_Sscheduling_Slibcluster_Utask_Umanager.so+0x7638b)
[2025-06-26T00:54:16Z]     #14 ray::raylet::ClusterTaskManager::QueueAndScheduleTask(ray::RayTask, bool, bool, ray::rpc::RequestWorkerLeaseReply*, std::__1::function<void (ray::Status, std::__1::function<void ()>, std::__1::function<void ()>)>) /proc/self/cwd/src/ray/raylet/scheduling/cluster_task_manager.cc:72:3 (libsrc_Sray_Sraylet_Sscheduling_Slibcluster_Utask_Umanager.so+0x746e9)
[2025-06-26T00:54:16Z]     #15 ray::raylet::NodeManager::HandleRequestWorkerLease(ray::rpc::RequestWorkerLeaseRequest, ray::rpc::RequestWorkerLeaseReply*, std::__1::function<void (ray::Status, std::__1::function<void ()>, std::__1::function<void ()>)>) /proc/self/cwd/src/ray/raylet/node_manager.cc:1842:25 (liblibraylet_Ulib.so+0x24b9c0)
[2025-06-26T00:54:16Z]     #16 ray::rpc::ServerCallImpl<ray::rpc::NodeManagerServiceHandler, ray::rpc::RequestWorkerLeaseRequest, ray::rpc::RequestWorkerLeaseReply, (ray::rpc::AuthType)0>::HandleRequestImpl(bool) /proc/self/cwd/bazel-out/k8-opt/bin/_virtual_includes/rpc_server_call/ray/rpc/server_call.h:279:7 (liblibraylet_Ulib.so+0x28e727)
[2025-06-26T00:54:16Z]     #17 ray::rpc::ServerCallImpl<ray::rpc::NodeManagerServiceHandler, ray::rpc::RequestWorkerLeaseRequest, ray::rpc::RequestWorkerLeaseReply, (ray::rpc::AuthType)0>::HandleRequest()::'lambda'()::operator()() const /proc/self/cwd/bazel-out/k8-opt/bin/_virtual_includes/rpc_server_call/ray/rpc/server_call.h:240:47 (liblibraylet_Ulib.so+0x28e4b4)
[2025-06-26T00:54:16Z]     #18 decltype(std::__1::forward<ray::rpc::NodeManagerServiceHandler>(fp)(std::__1::forward<ray::rpc::RequestWorkerLeaseRequest>(fp0)...)) std::__1::__invoke<ray::rpc::ServerCallImpl<ray::rpc::NodeManagerServiceHandler, ray::rpc::RequestWorkerLeaseRequest, ray::rpc::RequestWorkerLeaseReply, (ray::rpc::AuthType)0>::HandleRequest()::'lambda'()&>(ray::rpc::NodeManagerServiceHandler&&, ray::rpc::RequestWorkerLeaseRequest&&...) /opt/llvm/bin/../include/c++/v1/type_traits:3694:1 (liblibraylet_Ulib.so+0x28e4b4)
[2025-06-26T00:54:16Z]     #19 void std::__1::__invoke_void_return_wrapper<void, true>::__call<ray::rpc::ServerCallImpl<ray::rpc::NodeManagerServiceHandler, ray::rpc::RequestWorkerLeaseRequest, ray::rpc::RequestWorkerLeaseReply, (ray::rpc::AuthType)0>::HandleRequest()::'lambda'()&>(ray::rpc::NodeManagerServiceHandler&&...) /opt/llvm/bin/../include/c++/v1/__functional_base:348:9 (liblibraylet_Ulib.so+0x28e4b4)
[2025-06-26T00:54:16Z]     #20 std::__1::__function::__alloc_func<ray::rpc::ServerCallImpl<ray::rpc::NodeManagerServiceHandler, ray::rpc::RequestWorkerLeaseRequest, ray::rpc::RequestWorkerLeaseReply, (ray::rpc::AuthType)0>::HandleRequest()::'lambda'(), std::__1::allocator<ray::rpc::ServerCallImpl<ray::rpc::NodeManagerServiceHandler, ray::rpc::RequestWorkerLeaseRequest, ray::rpc::RequestWorkerLeaseReply, (ray::rpc::AuthType)0>::HandleRequest()::'lambda'()>, void ()>::operator()() /opt/llvm/bin/../include/c++/v1/functional:1558:16 (liblibraylet_Ulib.so+0x28e4b4)
[2025-06-26T00:54:16Z]     #21 std::__1::__function::__func<ray::rpc::ServerCallImpl<ray::rpc::NodeManagerServiceHandler, ray::rpc::RequestWorkerLeaseRequest, ray::rpc::RequestWorkerLeaseReply, (ray::rpc::AuthType)0>::HandleRequest()::'lambda'(), std::__1::allocator<ray::rpc::ServerCallImpl<ray::rpc::NodeManagerServiceHandler, ray::rpc::RequestWorkerLeaseRequest, ray::rpc::RequestWorkerLeaseReply, (ray::rpc::AuthType)0>::HandleRequest()::'lambda'()>, void ()>::operator()() /opt/llvm/bin/../include/c++/v1/functional:1732:12 (liblibraylet_Ulib.so+0x28e4b4)
[2025-06-26T00:54:16Z]     #22 std::__1::__function::__value_func<void ()>::operator()() const /opt/llvm/bin/../include/c++/v1/functional:1885:16 (libsrc_Sray_Scommon_Slibevent_Ustats.so+0x6a751)
[2025-06-26T00:54:16Z]     #23 std::__1::function<void ()>::operator()() const /opt/llvm/bin/../include/c++/v1/functional:2560:12 (libsrc_Sray_Scommon_Slibevent_Ustats.so+0x6a751)
[2025-06-26T00:54:16Z]     #24 EventTracker::RecordExecution(std::__1::function<void ()> const&, std::__1::shared_ptr<StatsHandle>) /proc/self/cwd/src/ray/common/event_stats.cc:113:3 (libsrc_Sray_Scommon_Slibevent_Ustats.so+0x6a751)
[2025-06-26T00:54:16Z]     #25 instrumented_io_context::post(std::__1::function<void ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long)::$_0::operator()() /proc/self/cwd/src/ray/common/asio/instrumented_io_context.cc:99:7 (libsrc_Sray_Scommon_Slibasio.so+0x9627c)
[2025-06-26T00:54:16Z]     #26 decltype(std::__1::forward<instrumented_io_context::post(std::__1::function<void ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long)::$_0&>(fp)()) std::__1::__invoke<instrumented_io_context::post(std::__1::function<void ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long)::$_0&>(instrumented_io_context::post(std::__1::function<void ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long)::$_0&) /opt/llvm/bin/../include/c++/v1/type_traits:3694:1 (libsrc_Sray_Scommon_Slibasio.so+0x9627c)
[2025-06-26T00:54:16Z]     #27 void std::__1::__invoke_void_return_wrapper<void, true>::__call<instrumented_io_context::post(std::__1::function<void ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long)::$_0&>(instrumented_io_context::post(std::__1::function<void ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long)::$_0&) /opt/llvm/bin/../include/c++/v1/__functional_base:348:9 (libsrc_Sray_Scommon_Slibasio.so+0x9627c)
[2025-06-26T00:54:16Z]     #28 std::__1::__function::__alloc_func<instrumented_io_context::post(std::__1::function<void ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long)::$_0, std::__1::allocator<instrumented_io_context::post(std::__1::function<void ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long)::$_0>, void ()>::operator()() /opt/llvm/bin/../include/c++/v1/functional:1558:16 (libsrc_Sray_Scommon_Slibasio.so+0x9627c)
[2025-06-26T00:54:16Z]     #29 std::__1::__function::__func<instrumented_io_context::post(std::__1::function<void ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long)::$_0, std::__1::allocator<instrumented_io_context::post(std::__1::function<void ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, long)::$_0>, void ()>::operator()() /opt/llvm/bin/../include/c++/v1/functional:1732:12 (libsrc_Sray_Scommon_Slibasio.so+0x9627c)
[2025-06-26T00:54:16Z]     #30 std::__1::__function::__value_func<void ()>::operator()() const /opt/llvm/bin/../include/c++/v1/functional:1885:16 (libsrc_Sray_Scommon_Slibasio.so+0x9506e)
[2025-06-26T00:54:16Z]     #31 std::__1::function<void ()>::operator()() const /opt/llvm/bin/../include/c++/v1/functional:2560:12 (libsrc_Sray_Scommon_Slibasio.so+0x9506e)
[2025-06-26T00:54:16Z]     #32 boost::asio::detail::binder0<std::__1::function<void ()> >::operator()() /proc/self/cwd/external/boost/boost/asio/detail/bind_handler.hpp:60:5 (libsrc_Sray_Scommon_Slibasio.so+0x9506e)
[2025-06-26T00:54:16Z]     #33 void boost::asio::asio_handler_invoke<boost::asio::detail::binder0<std::__1::function<void ()> > >(boost::asio::detail::binder0<std::__1::function<void ()> >&, ...) /proc/self/cwd/external/boost/boost/asio/handler_invoke_hook.hpp:88:3 (libsrc_Sray_Scommon_Slibasio.so+0x9506e)
[2025-06-26T00:54:16Z]     #34 void boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder0<std::__1::function<void ()> >, std::__1::function<void ()> >(boost::asio::detail::binder0<std::__1::function<void ()> >&, std::__1::function<void ()>&) /proc/self/cwd/external/boost/boost/asio/detail/handler_invoke_helpers.hpp:54:3 (libsrc_Sray_Scommon_Slibasio.so+0x9506e)
[2025-06-26T00:54:16Z]     #35 void boost::asio::detail::asio_handler_invoke<boost::asio::detail::binder0<std::__1::function<void ()> >, std::__1::function<void ()> >(boost::asio::detail::binder0<std::__1::function<void ()> >&, boost::asio::detail::binder0<std::__1::function<void ()> >*) /proc/self/cwd/external/boost/boost/asio/detail/bind_handler.hpp:111:3 (libsrc_Sray_Scommon_Slibasio.so+0x9506e)
[2025-06-26T00:54:16Z]     #36 void boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder0<std::__1::function<void ()> >, boost::asio::detail::binder0<std::__1::function<void ()> > >(boost::asio::detail::binder0<std::__1::function<void ()> >&, boost::asio::detail::binder0<std::__1::function<void ()> >&) /proc/self/cwd/external/boost/boost/asio/detail/handler_invoke_helpers.hpp:54:3 (libsrc_Sray_Scommon_Slibasio.so+0x9506e)
[2025-06-26T00:54:16Z]     #37 boost::asio::detail::executor_op<boost::asio::detail::binder0<std::__1::function<void ()> >, std::__1::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) /proc/self/cwd/external/boost/boost/asio/detail/executor_op.hpp:70:7 (libsrc_Sray_Scommon_Slibasio.so+0x9506e)
[2025-06-26T00:54:16Z]     #38 boost::asio::detail::scheduler_operation::complete(void*, boost::system::error_code const&, unsigned long) /proc/self/cwd/external/boost/boost/asio/detail/scheduler_operation.hpp:40:5 (libexternal_Sboost_Slibasio.so+0xa9840)
[2025-06-26T00:54:16Z]     #39 boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) /proc/self/cwd/external/boost/boost/asio/detail/impl/scheduler.ipp:492:12 (libexternal_Sboost_Slibasio.so+0xa9840)
[2025-06-26T00:54:16Z]     #40 boost::asio::detail::scheduler::run(boost::system::error_code&) /proc/self/cwd/external/boost/boost/asio/detail/impl/scheduler.ipp:210:10 (libexternal_Sboost_Slibasio.so+0x96501)
[2025-06-26T00:54:16Z]     #41 boost::asio::io_context::run() /proc/self/cwd/external/boost/boost/asio/impl/io_context.ipp:63:24 (libexternal_Sboost_Slibasio.so+0x9638e)
[2025-06-26T00:54:16Z]     #42 ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_7::operator()() const /proc/self/cwd/src/ray/raylet/test/node_manager_test.cc:480:17 (node_manager_test+0x461784)
[2025-06-26T00:54:16Z]     #43 decltype(std::__1::forward<ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_7>(fp)()) std::__1::__invoke<ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_7>(ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_7&&) /opt/llvm/bin/../include/c++/v1/type_traits:3694:1 (node_manager_test+0x461784)
[2025-06-26T00:54:16Z]     #44 void std::__1::__thread_execute<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_7>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_7>&, std::__1::__tuple_indices<>) /opt/llvm/bin/../include/c++/v1/thread:280:5 (node_manager_test+0x461784)
[2025-06-26T00:54:16Z]     #45 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_7> >(void*) /opt/llvm/bin/../include/c++/v1/thread:291:5 (node_manager_test+0x461784)
[2025-06-26T00:54:16Z] 
[2025-06-26T00:54:16Z]   Thread T29 (tid=1523, running) created by main thread at:
[2025-06-26T00:54:16Z]     #0 pthread_create /tmp/llvm/utils/release/final/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:965:3 (node_manager_test+0x155a0b)
[2025-06-26T00:54:16Z]     #1 std::__1::__libcpp_thread_create(unsigned long*, void* (*)(void*), void*) /opt/llvm/bin/../include/c++/v1/__threading_support:509:10 (node_manager_test+0x1d99e5)
[2025-06-26T00:54:16Z]     #2 std::__1::thread::thread<ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_8, void>(ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_8&&) /opt/llvm/bin/../include/c++/v1/thread:307:16 (node_manager_test+0x1d99e5)
[2025-06-26T00:54:16Z]     #3 ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody() /proc/self/cwd/src/ray/raylet/test/node_manager_test.cc:483:15 (node_manager_test+0x1d99e5)
[2025-06-26T00:54:16Z]     #4 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /proc/self/cwd/external/com_google_googletest/googletest/src/gtest.cc:2612:10 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0xd4a1f)
[2025-06-26T00:54:16Z]     #5 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /proc/self/cwd/external/com_google_googletest/googletest/src/gtest.cc:2648:14 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0xd4a1f)
[2025-06-26T00:54:16Z]     #6 testing::Test::Run() /proc/self/cwd/external/com_google_googletest/googletest/src/gtest.cc:2687:5 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0xd4901)
[2025-06-26T00:54:16Z]     #7 testing::TestInfo::Run() /proc/self/cwd/external/com_google_googletest/googletest/src/gtest.cc:2836:11 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0xd63c8)
[2025-06-26T00:54:16Z]     #8 testing::TestSuite::Run() /proc/self/cwd/external/com_google_googletest/googletest/src/gtest.cc:3015:30 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0xd79e4)
[2025-06-26T00:54:16Z]     #9 testing::internal::UnitTestImpl::RunAllTests() /proc/self/cwd/external/com_google_googletest/googletest/src/gtest.cc:5920:44 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0xecc44)
[2025-06-26T00:54:16Z]     #10 bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /proc/self/cwd/external/com_google_googletest/googletest/src/gtest.cc:2612:10 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0xec05f)
[2025-06-26T00:54:16Z]     #11 bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /proc/self/cwd/external/com_google_googletest/googletest/src/gtest.cc:2648:14 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0xec05f)
[2025-06-26T00:54:16Z]     #12 testing::UnitTest::Run() /proc/self/cwd/external/com_google_googletest/googletest/src/gtest.cc:5484:10 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0xebe4c)
[2025-06-26T00:54:16Z]     #13 RUN_ALL_TESTS() /proc/self/cwd/external/com_google_googletest/googletest/include/gtest/gtest.h:2317:73 (node_manager_test+0x1e0b6e)
[2025-06-26T00:54:16Z]     #14 main /proc/self/cwd/src/ray/raylet/test/node_manager_test.cc:653:10 (node_manager_test+0x1e0b6e)
[2025-06-26T00:54:16Z] 
[2025-06-26T00:54:16Z]   Thread T28 (tid=1522, running) created by main thread at:
[2025-06-26T00:54:16Z]     #0 pthread_create /tmp/llvm/utils/release/final/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:965:3 (node_manager_test+0x155a0b)
[2025-06-26T00:54:16Z]     #1 std::__1::__libcpp_thread_create(unsigned long*, void* (*)(void*), void*) /opt/llvm/bin/../include/c++/v1/__threading_support:509:10 (node_manager_test+0x1d996d)
[2025-06-26T00:54:16Z]     #2 std::__1::thread::thread<ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_7, void>(ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody()::$_7&&) /opt/llvm/bin/../include/c++/v1/thread:307:16 (node_manager_test+0x1d996d)
[2025-06-26T00:54:16Z]     #3 ray::raylet::NodeManagerTest_TestDetachedWorkerIsKilledByFailedWorker_Test::TestBody() /proc/self/cwd/src/ray/raylet/test/node_manager_test.cc:477:15 (node_manager_test+0x1d996d)
[2025-06-26T00:54:16Z]     #4 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /proc/self/cwd/external/com_google_googletest/googletest/src/gtest.cc:2612:10 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0xd4a1f)
[2025-06-26T00:54:16Z]     #5 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /proc/self/cwd/external/com_google_googletest/googletest/src/gtest.cc:2648:14 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0xd4a1f)
[2025-06-26T00:54:16Z]     #6 testing::Test::Run() /proc/self/cwd/external/com_google_googletest/googletest/src/gtest.cc:2687:5 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0xd4901)
[2025-06-26T00:54:16Z]     #7 testing::TestInfo::Run() /proc/self/cwd/external/com_google_googletest/googletest/src/gtest.cc:2836:11 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0xd63c8)
[2025-06-26T00:54:16Z]     #8 testing::TestSuite::Run() /proc/self/cwd/external/com_google_googletest/googletest/src/gtest.cc:3015:30 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0xd79e4)
[2025-06-26T00:54:16Z]     #9 testing::internal::UnitTestImpl::RunAllTests() /proc/self/cwd/external/com_google_googletest/googletest/src/gtest.cc:5920:44 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0xecc44)
[2025-06-26T00:54:16Z]     #10 bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /proc/self/cwd/external/com_google_googletest/googletest/src/gtest.cc:2612:10 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0xec05f)
[2025-06-26T00:54:16Z]     #11 bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /proc/self/cwd/external/com_google_googletest/googletest/src/gtest.cc:2648:14 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0xec05f)
[2025-06-26T00:54:16Z]     #12 testing::UnitTest::Run() /proc/self/cwd/external/com_google_googletest/googletest/src/gtest.cc:5484:10 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0xebe4c)
[2025-06-26T00:54:16Z]     #13 RUN_ALL_TESTS() /proc/self/cwd/external/com_google_googletest/googletest/include/gtest/gtest.h:2317:73 (node_manager_test+0x1e0b6e)
[2025-06-26T00:54:16Z]     #14 main /proc/self/cwd/src/ray/raylet/test/node_manager_test.cc:653:10 (node_manager_test+0x1e0b6e)
[2025-06-26T00:54:16Z] 
[2025-06-26T00:54:16Z] SUMMARY: ThreadSanitizer: data race /proc/self/cwd/external/com_google_absl/absl/container/internal/raw_hash_set.h:589:12 in absl::lts_20230802::container_internal::GroupSse2Impl::GroupSse2Impl(absl::lts_20230802::container_internal::ctrl_t const*)

This PR resolves the flaky TSAN failures by

  1. moving the execution of publish_worker_failure_callback and publish_node_change_callback to the io_service_ as well.
  2. moving the initialization of MockWorker forward to initialize the flat_hash_maps under the TaskSpecification::GetSchedulingClass() first, otherwise it could also cause flaky TSAN failures with later ResourceSet::Get().

Related issue number

fixes #54096

Checks

I checked with --runs_per_test=100 --config=tsan, they all pass now.

▶ bazel test --runs_per_test=100 --test_filter=NodeManagerTest.TestDetachedWorkerIsKilledByFailedNode --config=tsan -- //:node_manager_test
INFO: Invocation ID: 42099da3-114a-4b76-893d-ec3786f60c8f
INFO: Analyzed target //:node_manager_test (0 packages loaded, 0 targets configured).
INFO: Found 1 test target...
Target //:node_manager_test up-to-date:
  bazel-bin/node_manager_test
INFO: Elapsed time: 67.889s, Critical Path: 11.98s
INFO: 101 processes: 1 internal, 100 darwin-sandbox.
INFO: Build completed successfully, 101 total actions
//:node_manager_test                                                     PASSED in 6.6s
  Stats over 100 runs: max = 6.6s, min = 3.8s, avg = 4.7s, dev = 0.7s

Executed 1 out of 1 test: 1 test passes.

▶ bazel test --runs_per_test=100 --test_filter=NodeManagerTest.TestDetachedWorkerIsKilledByFailedWorker --config=tsan -- //:node_manager_test
INFO: Invocation ID: ce6f9ea5-7087-4a63-904b-d108c7ecb536
INFO: Analyzed target //:node_manager_test (0 packages loaded, 166 targets configured).
INFO: Found 1 test target...
Target //:node_manager_test up-to-date:
  bazel-bin/node_manager_test
INFO: Elapsed time: 62.382s, Critical Path: 10.12s
INFO: 101 processes: 1 internal, 100 darwin-sandbox.
INFO: Build completed successfully, 101 total actions
//:node_manager_test                                                     PASSED in 5.9s
  Stats over 100 runs: max = 5.9s, min = 3.8s, avg = 4.4s, dev = 0.4s

Executed 1 out of 1 test: 1 test passes.
  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rueian rueian force-pushed the fix-tsan-tests-54096-2 branch from fcd5e79 to 0b88eae Compare June 26, 2025 07:44
@rueian rueian added the go add ONLY when ready to merge, run all tests label Jun 26, 2025
@rueian rueian marked this pull request as ready for review June 26, 2025 08:03
@Copilot Copilot AI review requested due to automatic review settings June 26, 2025 08:03
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes flaky ThreadSanitizer failures in NodeManagerTest by moving callback invocations onto the io_service_ and ensuring test setup (mock worker spawning) occurs before invoking GCS registration.

  • Moved mock worker process initialization ahead of RegisterGcs() in both tests
  • Wrapped publish_worker_failure_callback and publish_node_change_callback in io_service_.post with std::promise synchronization
Comments suppressed due to low confidence (2)

src/ray/raylet/test/node_manager_test.cc:519

  • [nitpick] The variable name promise is generic and may be ambiguous when multiple promises are in scope; consider using a more descriptive name like worker_failure_promise or node_change_promise.
    std::promise<void> promise;

src/ray/raylet/test/node_manager_test.cc:477

  • The tests spawn external sleep processes; ensure they are properly terminated or reaped after each test to avoid leaving stray processes running on the host.
  auto [proc, spawn_error] =

Comment on lines 476 to 505
const auto worker = std::make_shared<MockWorker>(WorkerID::FromRandom(), 10);
auto [proc, spawn_error] =
Process::Spawn(std::vector<std::string>{"sleep", "1000"}, true);
EXPECT_FALSE(spawn_error);
worker->SetProcess(proc);
Copy link
Preview

Copilot AI Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] This mock worker setup is duplicated in both tests; consider extracting it into a helper function to reduce code duplication and improve readability.

Suggested change
const auto worker = std::make_shared<MockWorker>(WorkerID::FromRandom(), 10);
auto [proc, spawn_error] =
Process::Spawn(std::vector<std::string>{"sleep", "1000"}, true);
EXPECT_FALSE(spawn_error);
worker->SetProcess(proc);
const auto worker = CreateMockWorkerWithProcess();

Copilot uses AI. Check for mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The need for Process::Spawn will be removed once I move NodeManager::KillWorker to Worker implementations and mocks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rueian can you add a TODO with an issue/ticket?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR for removing Process:Spwan is here #54068. I am still working on it :)

@rueian
Copy link
Contributor Author

rueian commented Jun 26, 2025

cc @dayshah @israbbani for review.

@israbbani israbbani self-assigned this Jun 26, 2025
Copy link
Contributor

@israbbani israbbani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these tests are getting too confusing. We should

  1. Make the relevant gRPC handlers public in the NodeManager so we can delete all the gRPC client logic
  2. Remove usage of the io_service_ on a separate thread and run it on the test thread to avoid data races. (See #53934 (comment))

I plan on doing the same thing in my PR #53934.

Comment on lines 476 to 505
const auto worker = std::make_shared<MockWorker>(WorkerID::FromRandom(), 10);
auto [proc, spawn_error] =
Process::Spawn(std::vector<std::string>{"sleep", "1000"}, true);
EXPECT_FALSE(spawn_error);
worker->SetProcess(proc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rueian can you add a TODO with an issue/ticket?

Fix flaky TSAN check.

Signed-off-by: Rueian <[email protected]>
Signed-off-by: rueian <[email protected]>
@rueian rueian force-pushed the fix-tsan-tests-54096-2 branch from 0b88eae to 9553032 Compare June 26, 2025 20:28
@rueian
Copy link
Contributor Author

rueian commented Jun 26, 2025

I think these tests are getting too confusing. We should

  1. Make the relevant gRPC handlers public in the NodeManager so we can delete all the gRPC client logic
  2. Remove usage of the io_service_ on a separate thread and run it on the test thread to avoid data races. (See [core] Fix race condition b/w object eviction & repinning for recovery. #53934 (comment))

I plan on doing the same thing in my PR #53934.

Done! I have rewritten these tests to let io_service_ run on the main thread! Also made HandleRequestWorkerLease public so that we don't need a grpc client in the tests. PTAL! @israbbani

Copy link
Contributor

@israbbani israbbani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me. I triggered TSAN in premerge. @edoakes plz merge if premerge passes.

// The leased worker should not be killed by this because it is a detached actor.
rpc::WorkerDeltaData delta_data;
delta_data.set_worker_id(owner_worker_id.Binary());
publish_worker_failure_callback(std::move(delta_data));
// Wait for more than kill_worker_timeout_milliseconds.
std::this_thread::sleep_for(std::chrono::seconds(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your follow up PR to clean up the test, can we remove this sleep_for? We could try setting kill_worker_timeout_milliseconds to make this synchronous or think of another way to do this. Sleeps in unit tests have a tendency to make them flaky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree! I already removed it in #54068, but I am still working on it.

@edoakes edoakes enabled auto-merge (squash) June 26, 2025 22:57
@github-actions github-actions bot disabled auto-merge June 27, 2025 00:21
@edoakes edoakes merged commit 2a1fe48 into ray-project:master Jun 27, 2025
5 checks passed
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
## Why are these changes needed?

I found that the TSAN check for NodeManagerTest becomes flaky after
#54097. Refer to the history
below; the first attempt was unsuccessful.

---------

Signed-off-by: Rueian <[email protected]>
Co-authored-by: Ibrahim Rabbani <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] TSAN failing on node_manager_test
3 participants