Skip to content

Conversation

wassimbensalem
Copy link

@wassimbensalem wassimbensalem commented Sep 4, 2025

Problem

Katib trials can run indefinitely, consuming unlimited cluster resources and potentially causing resource exhaustion.
The current timeout supports only per Experiment and not per trial deadline.

Solution

Add trial_timeout parameter to the tune() API to automatically terminate trials after a specified duration. If nothing provided no timeout will be applied.

Changes

  • New Parameter: trial_timeout: Optional[int] = None in tune() function
  • Job Support: Sets active_deadline_seconds on Kubernetes Job specs
  • PyTorchJob Support: Sets active_deadline_seconds on PyTorchJob RunPolicy
  • Backward Compatible: Existing code works unchanged

Usage

# Job trial with 1-hour timeout
katib_client.tune(
    name="experiment",
    objective=my_function,
    parameters={"lr": katib.search.double(0.001, 0.1)},
    trial_timeout=3600  # 1 hour max per trial
)

# PyTorchJob trial with 2-hour timeout
katib_client.tune(
    name="distributed-experiment",
    objective=my_function,
    parameters={"batch_size": katib.search.int(32, 256)},
    resources_per_trial=TrainerResources(num_workers=4),
    trial_timeout=7200  # 2 hours max per trial
)

Benefits

  • Resource Protection: Prevents runaway trials from consuming all resources
  • Cost Control: Limits resource usage per trial
  • Production Safety: Safe to run in production environments
  • Zero Breaking Changes: Fully backward compatible

- Add optional trial_timeout parameter to tune() function
- Support timeout for both Job and PyTorchJob trial types
- Set active_deadline_seconds on Job spec for Job-based trials
- Set active_deadline_seconds on RunPolicy for PyTorchJob-based trials
- Add comprehensive documentation and usage examples
- Add test cases for both Job and PyTorchJob timeout scenarios
- Maintain backward compatibility with existing code

This feature prevents individual trials from running indefinitely
and consuming cluster resources by allowing users to specify
per-trial timeouts in seconds.

Signed-off-by: wassimbensalem <[email protected]>
@wassimbensalem wassimbensalem force-pushed the feat/trial-timeout-parameter branch from 3b08fb7 to d13028a Compare September 4, 2025 19:05
@wassimbensalem
Copy link
Author

/assign

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this great contribution @wassimbensalem!

packages_to_install: List[str] = None,
pip_index_url: str = "https://pypi.org/simple",
metrics_collector_config: Dict[str, Any] = {"kind": "StdOut"},
trial_timeout: Optional[int] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like trial_timeout parameter, but I was wondering whether we need to be aligned with k8s API here or not ?
For example, we can call it: trial_active_deadline_seconds
cc @kubeflow/kubeflow-sdk-team

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I was just thinking the trial_timeout is easier for the users to understand. But we can definetyl align with k8s API namings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like trial_timeout too, any thoughts @kubeflow/kubeflow-sdk-team ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually "timeout" is for duration while "deadline" is for a fixed point in time.

An example is from the Golang context API: https://pkg.go.dev/context#WithTimeout

It might be Kubernetes APIs have a "timeout after start" semantic which might explain why deadline is used.

So if we want to provide the ability to pass a duration, it seems trial_timeout would be more appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be Kubernetes APIs have a "timeout after start" semantic which might explain why deadline is used.

But don't we indicate "timeout after Trial start" with that property ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, assuming that interpretation of the Kubernetes naming choice is correct, tune(trial_active_deadline_seconds=...) may be more correct if we consider the duration applies after the trial actually starts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wassimbensalem Does this name sound good ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trial_active_deadline_seconds looks good to me

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich sure, I will update the changes asap!

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

- Fix line length and formatting in katib_client_test.py
- Fix line length and formatting in utils.py
- All changes made by black formatter

Signed-off-by: wassimbensalem <[email protected]>
- Updated parameter name in function signatures and documentation
- Updated test cases and assertions to use new parameter name
- Fixed line length issues to meet linting requirements
- Maintains backward compatibility with Kubernetes activeDeadlineSeconds field

Signed-off-by: wassimbensalem <[email protected]>
@wassimbensalem wassimbensalem force-pushed the feat/trial-timeout-parameter branch from 5267a91 to 69a34bc Compare September 29, 2025 09:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants