-
Notifications
You must be signed in to change notification settings - Fork 485
feat: Add trial_timeout parameter to Katib tune API #2568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat: Add trial_timeout parameter to Katib tune API #2568
Conversation
- Add optional trial_timeout parameter to tune() function - Support timeout for both Job and PyTorchJob trial types - Set active_deadline_seconds on Job spec for Job-based trials - Set active_deadline_seconds on RunPolicy for PyTorchJob-based trials - Add comprehensive documentation and usage examples - Add test cases for both Job and PyTorchJob timeout scenarios - Maintain backward compatibility with existing code This feature prevents individual trials from running indefinitely and consuming cluster resources by allowing users to specify per-trial timeouts in seconds. Signed-off-by: wassimbensalem <[email protected]>
3b08fb7
to
d13028a
Compare
/assign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this great contribution @wassimbensalem!
packages_to_install: List[str] = None, | ||
pip_index_url: str = "https://pypi.org/simple", | ||
metrics_collector_config: Dict[str, Any] = {"kind": "StdOut"}, | ||
trial_timeout: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like trial_timeout
parameter, but I was wondering whether we need to be aligned with k8s API here or not ?
For example, we can call it: trial_active_deadline_seconds
cc @kubeflow/kubeflow-sdk-team
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I was just thinking the trial_timeout is easier for the users to understand. But we can definetyl align with k8s API namings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like trial_timeout
too, any thoughts @kubeflow/kubeflow-sdk-team ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually "timeout" is for duration while "deadline" is for a fixed point in time.
An example is from the Golang context API: https://pkg.go.dev/context#WithTimeout
It might be Kubernetes APIs have a "timeout after start" semantic which might explain why deadline is used.
So if we want to provide the ability to pass a duration, it seems trial_timeout
would be more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be Kubernetes APIs have a "timeout after start" semantic which might explain why deadline is used.
But don't we indicate "timeout after Trial start" with that property ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, assuming that interpretation of the Kubernetes naming choice is correct, tune(trial_active_deadline_seconds=...)
may be more correct if we consider the duration applies after the trial actually starts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wassimbensalem Does this name sound good ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trial_active_deadline_seconds
looks good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich sure, I will update the changes asap!
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
- Fix line length and formatting in katib_client_test.py - Fix line length and formatting in utils.py - All changes made by black formatter Signed-off-by: wassimbensalem <[email protected]>
- Updated parameter name in function signatures and documentation - Updated test cases and assertions to use new parameter name - Fixed line length issues to meet linting requirements - Maintains backward compatibility with Kubernetes activeDeadlineSeconds field Signed-off-by: wassimbensalem <[email protected]>
5267a91
to
69a34bc
Compare
Problem
Katib trials can run indefinitely, consuming unlimited cluster resources and potentially causing resource exhaustion.
The current timeout supports only per Experiment and not per trial deadline.
Solution
Add
trial_timeout
parameter to thetune()
API to automatically terminate trials after a specified duration. If nothing provided no timeout will be applied.Changes
trial_timeout: Optional[int] = None
intune()
functionactive_deadline_seconds
on Kubernetes Job specsactive_deadline_seconds
on PyTorchJob RunPolicyUsage
Benefits