Skip to content

[core] Fix check fail when task buffer periodical runner runs before RayEvent is initialized #55249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 5, 2025

Conversation

dayshah
Copy link
Contributor

@dayshah dayshah commented Aug 5, 2025

Why are these changes needed?

It's possible for this check to fail if the task buffer periodical runner starts before RayEvent is initialized.

ray/src/ray/util/event.cc

Lines 221 to 224 in f678778

RAY_LOG(FATAL)
<< "RayEventInit wasn't called with the necessary source type "
<< ExportEvent_SourceType_Name(export_event.source_type())
<< ". This indicates a bug in the code, and the event will be dropped.";

export_log_reporter_map_ is added to on RayEventInit. RayEventInit is called on the core worker after the actual core worker construction and after the task buffer is initialized and started, so there's a race here where the task buffer starts its thread and periodical runner and tries to flush, but RayEvent hasn't been initialized yet. So fixing by moving RayEvent initialization to before CreateCoreWorker.

Also making some minor cpp cleanup changes in related code.

Core worker creation -> RayEventInit

auto worker = CreateCoreWorker(options_, worker_id_);
auto write_locked = core_worker_.LockForWrite();
write_locked.Get() = worker;
}
// Initialize event framework.
if (RayConfig::instance().event_log_reporter_enabled() && !options_.log_dir.empty()) {
const std::vector<SourceTypeVariant> source_types = {
ray::rpc::Event_SourceType::Event_SourceType_CORE_WORKER,
ray::rpc::ExportEvent_SourceType::ExportEvent_SourceType_EXPORT_TASK};
RayEventInit(source_types,
absl::flat_hash_map<std::string, std::string>(),
options_.log_dir,
RayConfig::instance().event_level(),
RayConfig::instance().emit_event_to_log_file());

Task buffer start in CreateCoreWorker

if (!task_event_buffer->Start().ok()) {

Periodical runner start in Start, thread startup right before
https://github.com/ray-project/ray/blob/master/src/ray/core_worker/task_event_buffer.cc#L442

@dayshah dayshah requested a review from a team as a code owner August 5, 2025 12:04
@dayshah dayshah added the go add ONLY when ready to merge, run all tests label Aug 5, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @dayshah, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've addressed a race condition in the core worker initialization process where the task buffer's periodical runner could attempt to flush events before the RayEvent system was fully initialized. This could lead to a check failure. My fix reorders the initialization steps to ensure RayEvent is ready before the task buffer starts its operations. Additionally, I've included several minor C++ code cleanups and modernizations to related event utility files.

Highlights

  • Race Condition Fix: Resolved a potential check failure where the task buffer's periodical runner could run before RayEvent was initialized, by reordering the initialization sequence.
  • Initialization Order: Moved the RayEvent initialization call to occur earlier in the CoreWorkerProcessImpl constructor, specifically before the core worker object is fully constructed and the task buffer is started.
  • C++ Code Modernization: Applied various C++ cleanup and modernization changes across the event utility files, including using std::call_once, passing parameters by const reference, and explicit default constructors/destructors.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a race condition by reordering the initialization of RayEvent to occur before the task buffer is started. The logic for the fix is sound. Additionally, the PR includes numerous C++ modernizations and cleanups, such as adopting std::call_once, using the pass-by-value-and-move idiom, and adhering to the Rule of Five for several classes, which significantly improve code quality and maintainability. I have one suggestion to ensure build stability by including a missing header.

Copy link
Collaborator

@can-anyscale can-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch; i think the event framework is also init inside raylet, gcs as well - do those have the same issues?

source_types, custom_fields, log_dir, event_level, emit_event_to_log_file);
});
static std::once_flag init_once_;
std::call_once(init_once_, [&]() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: what the init_once_ for? does it crash of just ignore if this function is called twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it just ignores the second call if called twice
https://en.cppreference.com/w/cpp/thread/call_once.html

@dayshah
Copy link
Contributor Author

dayshah commented Aug 5, 2025

nice catch; i think the event framework is also init inside raylet, gcs as well - do those have the same issues?

I think in those we'll RayEventInit first before running anything

@dayshah dayshah merged commit f9d1ff1 into ray-project:master Aug 5, 2025
5 checks passed
@dayshah dayshah deleted the task-buffer-check branch August 5, 2025 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants