Skip to content

[core] Improve status messages and add comments about stale seq_no handling #54470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 10, 2025

Conversation

edoakes
Copy link
Collaborator

@edoakes edoakes commented Jul 9, 2025

I had a hard time piecing together the logic around handling tasks with stale and canceled sequence numbers.

Added some comments and updated the status messages to improve readability. Specifically, this message was misleading: in this branch, the server is deciding to cancel the stale messages, not the client.

Also standardized the naming to "dependencies resolved" (previously a mix of "can execute", "satisfied", "resolved").

Signed-off-by: Edward Oakes <[email protected]>
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Jul 9, 2025
@edoakes edoakes requested review from israbbani and dayshah July 9, 2025 17:25
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @edoakes, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the readability and accuracy of the ActorSchedulingQueue's internal logic, particularly concerning how it manages and cancels actor tasks based on their sequence numbers. The changes clarify the conditions for task cancellation, streamline the timeout mechanism for pending tasks, and update associated log messages to provide more precise diagnostic information, making the system's behavior easier to understand and debug.

Highlights

  • Improved Clarity for Stale Sequence Number Handling: Added detailed comments to ActorSchedulingQueue::ScheduleRequests to explain the conditions under which stale RPCs are canceled. This specifically clarifies that cancellation occurs when the client has processed up to a higher sequence number, indicating the task is no longer needed (e.g., it has been retried).
  • Refactored and Clarified Timeout Logic: The logic previously encapsulated in the OnSequencingWaitTimeout method has been inlined directly into the async_wait callback within ScheduleRequests. New, comprehensive comments were added to explain the purpose of this timeout, which is to handle scenarios where an expected earlier sequence number task is not received, potentially due to client misbehavior or crashes.
  • Enhanced Status Messages for Task Cancellation: Updated the cancellation messages to be more precise and informative. The message for client-discarded tasks now explicitly states, 'Task cancelled due to stale sequence number. The client intentionally discarded this task.' Similarly, the message for server-side timeouts now clearly indicates, 'Server timed out while waiting for an earlier seq_no,' correcting previous misleading messages.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

});
}
}

/// Called when we time out waiting for an earlier task to show up.
void ActorSchedulingQueue::OnSequencingWaitTimeout() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I inlined this because it improved readability, but find to keep it separate if others prefer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i prefer the inline

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request improves the clarity of status messages and adds comments to the ActorSchedulingQueue to better explain the handling of stale sequence numbers. The changes enhance the readability and maintainability of the code, particularly around task cancellation and timeout scenarios.

Comment on lines 174 to 177
if (pending_actor_tasks_.empty() ||
!pending_actor_tasks_.begin()->second.CanExecute()) {
// No timeout for object dependency waits.
// Either there are no tasks to execute, or the head of line is blocked waiting for
// its dependencies. We do not set a timeout waiting for the dependencies.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is actually a potentially interesting edge case here:

It could be that the head-of-line task is blocking waiting for dependencies and it has an out-of-order sequence number. in that case, we would hang forever without the timeout applying. I'm not sure if this can practically happen, but it might be worth reordering the checks here.

@dayshah check my reasoning here. It should be relatively straightforward to fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's dependencies should eventually be fetched though right. And then we'll start the 30 second timer. Takes a little longer bc the 30s timer starts after the fetch but still correct.

The dependency fetch shouldn't hang for any other case than the one my retry pr is handling??? hopefully?

If the obj was totally lost, I'm hoping the dependency waiter still unblocks the task eventually, otherwise that's a problem everywhere. Or are you talking about some other case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should get unblocked eventually...

edoakes and others added 3 commits July 9, 2025 12:29
Signed-off-by: Edward Oakes <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
@cszhu cszhu added community-contribution Contributed by the community and removed community-contribution Contributed by the community labels Jul 9, 2025
Signed-off-by: Edward Oakes <[email protected]>
Copy link
Contributor

@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think an extra comment somewhere stating that the case for both of these uses is for the case where client failed to send the req. In the wait for 30s case, the server decided that on its own. In the eager cancel case, the client told the server that.

This ^ still isn't clear enough from the comments imo. This would maybe be overfitting the comments a bit to one possible failure case but still the logic is complicated enough to warrant that here imo.

I'm not sure what the right place to put that is though. Maybe even in the header or .proto?

@edoakes
Copy link
Collaborator Author

edoakes commented Jul 10, 2025

I'm not sure what the right place to put that is though. Maybe even in the header or .proto?

It is documented reasonably well here:

// The maximum sequence number for which the client has processed responses.

I've added clarifying comments about the server behavior.

@edoakes edoakes enabled auto-merge (squash) July 10, 2025 15:02
@edoakes edoakes disabled auto-merge July 10, 2025 15:02
@edoakes edoakes enabled auto-merge (squash) July 10, 2025 15:02
@edoakes edoakes merged commit 69e0e36 into ray-project:master Jul 10, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants