[core] Improve status messages and add comments about stale seq_no handling #54470

edoakes · 2025-07-09T17:24:59Z

I had a hard time piecing together the logic around handling tasks with stale and canceled sequence numbers.

Added some comments and updated the status messages to improve readability. Specifically, this message was misleading: in this branch, the server is deciding to cancel the stale messages, not the client.

Also standardized the naming to "dependencies resolved" (previously a mix of "can execute", "satisfied", "resolved").

Signed-off-by: Edward Oakes <[email protected]>

gemini-code-assist

Summary of Changes

Hello @edoakes, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the readability and accuracy of the ActorSchedulingQueue's internal logic, particularly concerning how it manages and cancels actor tasks based on their sequence numbers. The changes clarify the conditions for task cancellation, streamline the timeout mechanism for pending tasks, and update associated log messages to provide more precise diagnostic information, making the system's behavior easier to understand and debug.

Highlights

Improved Clarity for Stale Sequence Number Handling: Added detailed comments to ActorSchedulingQueue::ScheduleRequests to explain the conditions under which stale RPCs are canceled. This specifically clarifies that cancellation occurs when the client has processed up to a higher sequence number, indicating the task is no longer needed (e.g., it has been retried).
Refactored and Clarified Timeout Logic: The logic previously encapsulated in the OnSequencingWaitTimeout method has been inlined directly into the async_wait callback within ScheduleRequests. New, comprehensive comments were added to explain the purpose of this timeout, which is to handle scenarios where an expected earlier sequence number task is not received, potentially due to client misbehavior or crashes.
Enhanced Status Messages for Task Cancellation: Updated the cancellation messages to be more precise and informative. The message for client-discarded tasks now explicitly states, 'Task cancelled due to stale sequence number. The client intentionally discarded this task.' Similarly, the message for server-side timeouts now clearly indicates, 'Server timed out while waiting for an earlier seq_no,' correcting previous misleading messages.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

edoakes · 2025-07-09T17:25:47Z

src/ray/core_worker/transport/actor_scheduling_queue.cc

    });
  }
 }

-/// Called when we time out waiting for an earlier task to show up.
-void ActorSchedulingQueue::OnSequencingWaitTimeout() {


I inlined this because it improved readability, but find to keep it separate if others prefer

i prefer the inline

gemini-code-assist

Code Review

The pull request improves the clarity of status messages and adds comments to the ActorSchedulingQueue to better explain the handling of stale sequence numbers. The changes enhance the readability and maintainability of the code, particularly around task cancellation and timeout scenarios.

src/ray/core_worker/transport/actor_scheduling_queue.cc

edoakes · 2025-07-09T17:28:30Z

src/ray/core_worker/transport/actor_scheduling_queue.cc

  if (pending_actor_tasks_.empty() ||
      !pending_actor_tasks_.begin()->second.CanExecute()) {
-    // No timeout for object dependency waits.
+    // Either there are no tasks to execute, or the head of line is blocked waiting for
+    // its dependencies. We do not set a timeout waiting for the dependencies.


There is actually a potentially interesting edge case here:

It could be that the head-of-line task is blocking waiting for dependencies and it has an out-of-order sequence number. in that case, we would hang forever without the timeout applying. I'm not sure if this can practically happen, but it might be worth reordering the checks here.

@dayshah check my reasoning here. It should be relatively straightforward to fix.

It's dependencies should eventually be fetched though right. And then we'll start the 30 second timer. Takes a little longer bc the 30s timer starts after the fetch but still correct.

The dependency fetch shouldn't hang for any other case than the one my retry pr is handling??? hopefully?

If the obj was totally lost, I'm hoping the dependency waiter still unblocks the task eventually, otherwise that's a problem everywhere. Or are you talking about some other case.

Yes it should get unblocked eventually...

Signed-off-by: Edward Oakes <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Edward Oakes <[email protected]>

Signed-off-by: Edward Oakes <[email protected]>

dayshah

i think an extra comment somewhere stating that the case for both of these uses is for the case where client failed to send the req. In the wait for 30s case, the server decided that on its own. In the eager cancel case, the client told the server that.

This ^ still isn't clear enough from the comments imo. This would maybe be overfitting the comments a bit to one possible failure case but still the logic is complicated enough to warrant that here imo.

I'm not sure what the right place to put that is though. Maybe even in the header or .proto?

…es/seq-no-comment

Signed-off-by: Edward Oakes <[email protected]>

edoakes · 2025-07-10T15:02:35Z

I'm not sure what the right place to put that is though. Maybe even in the header or .proto?

It is documented reasonably well here:

ray/src/ray/protobuf/core_worker.proto

Line 88 in 2559159

// The maximum sequence number for which the client has processed responses.

I've added clarifying comments about the server behavior.

WIP

c2eebd7

Signed-off-by: Edward Oakes <[email protected]>

edoakes added the go add ONLY when ready to merge, run all tests label Jul 9, 2025

edoakes requested review from israbbani and dayshah July 9, 2025 17:25

gemini-code-assist bot reviewed Jul 9, 2025

View reviewed changes

edoakes commented Jul 9, 2025

View reviewed changes

gemini-code-assist bot reviewed Jul 9, 2025

View reviewed changes

src/ray/core_worker/transport/actor_scheduling_queue.cc Outdated Show resolved Hide resolved

src/ray/core_worker/transport/actor_scheduling_queue.cc Outdated Show resolved Hide resolved

src/ray/core_worker/transport/actor_scheduling_queue.cc Outdated Show resolved Hide resolved

edoakes commented Jul 9, 2025

View reviewed changes

edoakes and others added 3 commits July 9, 2025 12:29

WIP

3867ca0

Signed-off-by: Edward Oakes <[email protected]>

Update src/ray/core_worker/transport/actor_scheduling_queue.cc

2b7885e

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Edward Oakes <[email protected]>

fix

91ee2b1

Signed-off-by: Edward Oakes <[email protected]>

cszhu added community-contribution Contributed by the community and removed community-contribution Contributed by the community labels Jul 9, 2025

fix

f075074

Signed-off-by: Edward Oakes <[email protected]>

dayshah approved these changes Jul 9, 2025

View reviewed changes

edoakes added 2 commits July 10, 2025 09:58

Merge branch 'master' of https://github.com/ray-project/ray into eoak…

594a737

…es/seq-no-comment

fix

aeea870

Signed-off-by: Edward Oakes <[email protected]>

edoakes enabled auto-merge (squash) July 10, 2025 15:02

edoakes disabled auto-merge July 10, 2025 15:02

edoakes enabled auto-merge (squash) July 10, 2025 15:02

edoakes merged commit 69e0e36 into ray-project:master Jul 10, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] Improve status messages and add comments about stale seq_no handling #54470

[core] Improve status messages and add comments about stale seq_no handling #54470

Uh oh!

edoakes commented Jul 9, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

edoakes Jul 9, 2025

Uh oh!

dayshah Jul 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

edoakes Jul 9, 2025

Uh oh!

dayshah Jul 9, 2025

Uh oh!

edoakes Jul 10, 2025

Uh oh!

dayshah left a comment •

edited

Loading

Uh oh!

edoakes commented Jul 10, 2025

Uh oh!

Uh oh!

Uh oh!

[core] Improve status messages and add comments about stale seq_no handling #54470

[core] Improve status messages and add comments about stale seq_no handling #54470

Uh oh!

Conversation

edoakes commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

edoakes Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

dayshah Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

edoakes Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

dayshah Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

dayshah left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edoakes commented Jul 10, 2025

Uh oh!

Uh oh!

Uh oh!

edoakes commented Jul 9, 2025 •

edited

Loading

dayshah left a comment •

edited

Loading