-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[core][test] fix flaky data races in NodeManagerTest #54129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fcd5e79
to
0b88eae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes flaky ThreadSanitizer failures in NodeManagerTest
by moving callback invocations onto the io_service_
and ensuring test setup (mock worker spawning) occurs before invoking GCS registration.
- Moved mock worker process initialization ahead of
RegisterGcs()
in both tests - Wrapped
publish_worker_failure_callback
andpublish_node_change_callback
inio_service_.post
withstd::promise
synchronization
Comments suppressed due to low confidence (2)
src/ray/raylet/test/node_manager_test.cc:519
- [nitpick] The variable name
promise
is generic and may be ambiguous when multiple promises are in scope; consider using a more descriptive name likeworker_failure_promise
ornode_change_promise
.
std::promise<void> promise;
src/ray/raylet/test/node_manager_test.cc:477
- The tests spawn external
sleep
processes; ensure they are properly terminated or reaped after each test to avoid leaving stray processes running on the host.
auto [proc, spawn_error] =
const auto worker = std::make_shared<MockWorker>(WorkerID::FromRandom(), 10); | ||
auto [proc, spawn_error] = | ||
Process::Spawn(std::vector<std::string>{"sleep", "1000"}, true); | ||
EXPECT_FALSE(spawn_error); | ||
worker->SetProcess(proc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] This mock worker setup is duplicated in both tests; consider extracting it into a helper function to reduce code duplication and improve readability.
const auto worker = std::make_shared<MockWorker>(WorkerID::FromRandom(), 10); | |
auto [proc, spawn_error] = | |
Process::Spawn(std::vector<std::string>{"sleep", "1000"}, true); | |
EXPECT_FALSE(spawn_error); | |
worker->SetProcess(proc); | |
const auto worker = CreateMockWorkerWithProcess(); |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The need for Process::Spawn
will be removed once I move NodeManager::KillWorker
to Worker
implementations and mocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rueian can you add a TODO with an issue/ticket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR for removing Process:Spwan
is here #54068. I am still working on it :)
cc @dayshah @israbbani for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these tests are getting too confusing. We should
- Make the relevant gRPC handlers public in the NodeManager so we can delete all the gRPC client logic
- Remove usage of the io_service_ on a separate thread and run it on the test thread to avoid data races. (See #53934 (comment))
I plan on doing the same thing in my PR #53934.
const auto worker = std::make_shared<MockWorker>(WorkerID::FromRandom(), 10); | ||
auto [proc, spawn_error] = | ||
Process::Spawn(std::vector<std::string>{"sleep", "1000"}, true); | ||
EXPECT_FALSE(spawn_error); | ||
worker->SetProcess(proc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rueian can you add a TODO with an issue/ticket?
Fix flaky TSAN check. Signed-off-by: Rueian <[email protected]> Signed-off-by: rueian <[email protected]>
0b88eae
to
9553032
Compare
Done! I have rewritten these tests to let io_service_ run on the main thread! Also made |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me. I triggered TSAN in premerge. @edoakes plz merge if premerge passes.
// The leased worker should not be killed by this because it is a detached actor. | ||
rpc::WorkerDeltaData delta_data; | ||
delta_data.set_worker_id(owner_worker_id.Binary()); | ||
publish_worker_failure_callback(std::move(delta_data)); | ||
// Wait for more than kill_worker_timeout_milliseconds. | ||
std::this_thread::sleep_for(std::chrono::seconds(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your follow up PR to clean up the test, can we remove this sleep_for
? We could try setting kill_worker_timeout_milliseconds
to make this synchronous or think of another way to do this. Sleeps in unit tests have a tendency to make them flaky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree! I already removed it in #54068, but I am still working on it.
Signed-off-by: Rueian <[email protected]> Signed-off-by: Rueian <[email protected]>
## Why are these changes needed? I found that the TSAN check for NodeManagerTest becomes flaky after #54097. Refer to the history below; the first attempt was unsuccessful. --------- Signed-off-by: Rueian <[email protected]> Co-authored-by: Ibrahim Rabbani <[email protected]> Signed-off-by: elliot-barn <[email protected]>
Why are these changes needed?
I found that the TSAN check for NodeManagerTest becomes flaky after #54097. Refer to the history below; the first attempt was unsuccessful.
NodeManagerTest
can still fail the TSAN check sometimes at the following code path, which is different from the one mentioned at #54097 (comment):This PR resolves the flaky TSAN failures by
publish_worker_failure_callback
andpublish_node_change_callback
to theio_service_
as well.MockWorker
forward to initialize the flat_hash_maps under theTaskSpecification::GetSchedulingClass()
first, otherwise it could also cause flaky TSAN failures with laterResourceSet::Get()
.Related issue number
fixes #54096
Checks
I checked with
--runs_per_test=100 --config=tsan
, they all pass now.git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.