-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[core] Improve status messages and add comments about stale seq_no handling #54470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Edward Oakes <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @edoakes, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the readability and accuracy of the ActorSchedulingQueue
's internal logic, particularly concerning how it manages and cancels actor tasks based on their sequence numbers. The changes clarify the conditions for task cancellation, streamline the timeout mechanism for pending tasks, and update associated log messages to provide more precise diagnostic information, making the system's behavior easier to understand and debug.
Highlights
- Improved Clarity for Stale Sequence Number Handling: Added detailed comments to
ActorSchedulingQueue::ScheduleRequests
to explain the conditions under which stale RPCs are canceled. This specifically clarifies that cancellation occurs when the client has processed up to a higher sequence number, indicating the task is no longer needed (e.g., it has been retried). - Refactored and Clarified Timeout Logic: The logic previously encapsulated in the
OnSequencingWaitTimeout
method has been inlined directly into theasync_wait
callback withinScheduleRequests
. New, comprehensive comments were added to explain the purpose of this timeout, which is to handle scenarios where an expected earlier sequence number task is not received, potentially due to client misbehavior or crashes. - Enhanced Status Messages for Task Cancellation: Updated the cancellation messages to be more precise and informative. The message for client-discarded tasks now explicitly states, 'Task cancelled due to stale sequence number. The client intentionally discarded this task.' Similarly, the message for server-side timeouts now clearly indicates, 'Server timed out while waiting for an earlier seq_no,' correcting previous misleading messages.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
}); | ||
} | ||
} | ||
|
||
/// Called when we time out waiting for an earlier task to show up. | ||
void ActorSchedulingQueue::OnSequencingWaitTimeout() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I inlined this because it improved readability, but find to keep it separate if others prefer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i prefer the inline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request improves the clarity of status messages and adds comments to the ActorSchedulingQueue
to better explain the handling of stale sequence numbers. The changes enhance the readability and maintainability of the code, particularly around task cancellation and timeout scenarios.
if (pending_actor_tasks_.empty() || | ||
!pending_actor_tasks_.begin()->second.CanExecute()) { | ||
// No timeout for object dependency waits. | ||
// Either there are no tasks to execute, or the head of line is blocked waiting for | ||
// its dependencies. We do not set a timeout waiting for the dependencies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is actually a potentially interesting edge case here:
It could be that the head-of-line task is blocking waiting for dependencies and it has an out-of-order sequence number. in that case, we would hang forever without the timeout applying. I'm not sure if this can practically happen, but it might be worth reordering the checks here.
@dayshah check my reasoning here. It should be relatively straightforward to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's dependencies should eventually be fetched though right. And then we'll start the 30 second timer. Takes a little longer bc the 30s timer starts after the fetch but still correct.
The dependency fetch shouldn't hang for any other case than the one my retry pr is handling??? hopefully?
If the obj was totally lost, I'm hoping the dependency waiter still unblocks the task eventually, otherwise that's a problem everywhere. Or are you talking about some other case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it should get unblocked eventually...
Signed-off-by: Edward Oakes <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think an extra comment somewhere stating that the case for both of these uses is for the case where client failed to send the req. In the wait for 30s case, the server decided that on its own. In the eager cancel case, the client told the server that.
This ^ still isn't clear enough from the comments imo. This would maybe be overfitting the comments a bit to one possible failure case but still the logic is complicated enough to warrant that here imo.
I'm not sure what the right place to put that is though. Maybe even in the header or .proto?
…es/seq-no-comment
Signed-off-by: Edward Oakes <[email protected]>
It is documented reasonably well here: ray/src/ray/protobuf/core_worker.proto Line 88 in 2559159
I've added clarifying comments about the server behavior. |
I had a hard time piecing together the logic around handling tasks with stale and canceled sequence numbers.
Added some comments and updated the status messages to improve readability. Specifically, this message was misleading: in this branch, the server is deciding to cancel the stale messages, not the client.
Also standardized the naming to "dependencies resolved" (previously a mix of "can execute", "satisfied", "resolved").