-
Notifications
You must be signed in to change notification settings - Fork 485
[GSoC] KEP for Project 6: Push-based Metrics Collection for Katib #2328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
google-oss-prow
merged 13 commits into
kubeflow:master
from
Electronic-Waste:KEP-metrics-collector
Jun 28, 2024
Merged
Changes from 10 commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
ef09f05
doc: initial commit of gsoc proposal(project 6).
Electronic-Waste aff97bf
doc: complete KEP for gsoc proposal(Project 6).
Electronic-Waste 7ece3ed
chore: add non-goals and examples.
Electronic-Waste e923a33
chore: add .
Electronic-Waste 169857c
chore: add compatibility changes in trial controller.
Electronic-Waste 0cfac1d
chore: update architecture figure.
Electronic-Waste 6969dd7
chore: update format.
Electronic-Waste 1a2599d
chore: update doc after the review in 10th, June.
Electronic-Waste d2439f2
chore: add code link and remove namespace env variable.
Electronic-Waste 3208e67
chore: modify proposal after the review in 14th, June.
Electronic-Waste cc95ef0
chore: delete WIP label.
Electronic-Waste 4161cff
chore: add timeout param into report_metrics.
Electronic-Waste c32a84a
fix: metrics_collector_config spelling.
Electronic-Waste File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,184 @@ | ||
# Push-based Metrics Collection Proposal | ||
|
||
## Links | ||
|
||
- [katib/issues#577([Enhancement Request] Metrics Collector Push-based Implementation)](https://github.com/kubeflow/katib/issues/577) | ||
|
||
## Motivation | ||
|
||
[Katib](https://github.com/kubeflow/katib) is a Kubernetes-native project for automated machine learning (AutoML). It can not only tune hyperparameters of applications written in any language and natively supports many ML frameworks, but also supports features like early stopping and neural architecture search. | ||
|
||
In the procedure of tuning hyperparameters, Metrics Collector, which is implemented as a sidecar container attached to each training container in the [current design](https://github.com/kubeflow/katib/blob/master/docs/proposals/metrics-collector.md), will collect training logs from Trials once the training is complete. Then, the Metrics Collector will parse training logs to get appropriate metrics like accuracy or loss and pass the evaluation results to the HyperParameter tuning algorithm. | ||
|
||
However, current implementation of Metrics Collector is pull-based, raising some [design problems](https://github.com/kubeflow/training-operator/issues/722#issuecomment-405669269) such as determining the frequency we scrape the metrics, performance issues like the overhead caused by too many sidecar containers, and restrictions on developing environments which must support sidecar containers. Thus, we should implement a new API for Katib Python SDK to offer users a push-based way to store metrics directly into the Katib DB and resolve those issues raised by pull-based metrics collection. | ||
|
||
 | ||
|
||
Fig.1 Architecture of the new design | ||
|
||
### Goals | ||
1. **A new parameter in Python SDK function `tune`**: allow users to specify the method of collecting metrics(push-based/pull-based). | ||
|
||
2. **A new interface `report_metrics` in Python SDK**: push the metrics to Katib DB directly. | ||
|
||
3. The final metrics of worker pods should be **pushed to Katib DB directly** in the push mode of metrics collection. | ||
|
||
### Non-Goals | ||
1. Implement authentication model for Katib DB to push metrics. | ||
|
||
2. Support pushing data to different types of storage system(prometheus, self-defined interface etc.) | ||
|
||
|
||
## API | ||
|
||
### New Parameter in Python SDK Function `tune` | ||
|
||
We decided to add `metrics_collection_mechanism` to `tune` function in Python SDK. | ||
|
||
|
||
```Python | ||
def tune( | ||
self, | ||
name: str, | ||
objective: Callable, | ||
parameters: Dict[str, Any], | ||
base_image: str = constants.BASE_IMAGE_TENSORFLOW, | ||
namespace: Optional[str] = None, | ||
env_per_trial: Optional[Union[Dict[str, str], List[Union[client.V1EnvVar, client.V1EnvFromSource]]]] = None, | ||
algorithm_name: str = "random", | ||
algorithm_settings: Union[dict, List[models.V1beta1AlgorithmSetting], None] = None, | ||
objective_metric_name: str = None, | ||
additional_metric_names: List[str] = [], | ||
objective_type: str = "maximize", | ||
objective_goal: float = None, | ||
max_trial_count: int = None, | ||
parallel_trial_count: int = None, | ||
max_failed_trial_count: int = None, | ||
resources_per_trial: Union[dict, client.V1ResourceRequirements, None] = None, | ||
retain_trials: bool = False, | ||
packages_to_install: List[str] = None, | ||
pip_index_url: str = "https://pypi.org/simple", | ||
# The newly added parameter metrics_collector_config. | ||
# It specifies the config of metrics collector, for example, | ||
# metrics_collector_config={"kind": "Push"}, | ||
metrics_collector_config: Dict[str, Any] = {"kind": "StdOut"}, | ||
) | ||
``` | ||
|
||
### New Interface `report_metrics` in Python SDK | ||
|
||
```Python | ||
"""Push Metrics Directly to Katib DB | ||
|
||
[!!!] Trial name should always be passed into Katib Trials as env variable `KATIB_TRIAL_NAME`. | ||
|
||
Args: | ||
metrics: Dict of metrics pushed to Katib DB. | ||
For examle, `metrics = {"loss": 0.01, "accuracy": 0.99}`. | ||
db-manager-address: Address for the Katib DB Manager in this format: `ip-address:port`. | ||
|
||
Raises: | ||
RuntimeError: Unable to push Trial metrics to Katib DB. | ||
""" | ||
def report_metrics( | ||
metrics: Dict[str, Any], | ||
db_manager_address: str = constants.DEFAULT_DB_MANAGER_ADDRESS, | ||
) | ||
``` | ||
|
||
### A Simple Example: | ||
|
||
```Python | ||
import kubeflow.katib as katib | ||
|
||
# Step 1. Create an objective function with push-based metrics collection. | ||
def objective(parameters): | ||
# Import required packages. | ||
import kubeflow.katib as katib | ||
# Calculate objective function. | ||
result = 4 * int(parameters["a"]) - float(parameters["b"]) ** 2 | ||
# Push metrics to Katib DB. | ||
katib.report_metrics({"result": result}) | ||
|
||
# Step 2. Create HyperParameter search space. | ||
parameters = { | ||
"a": katib.search.int(min=10, max=20), | ||
"b": katib.search.double(min=0.1, max=0.2) | ||
} | ||
|
||
# Step 3. Create Katib Experiment with 12 Trials and 2 GPUs per Trial. | ||
katib_client = katib.KatibClient(namespace="kubeflow") | ||
name = "tune-experiment" | ||
katib_client.tune( | ||
name=name, | ||
objective=objective, | ||
parameters=parameters, | ||
objective_metric_name="result", | ||
max_trial_count=12, | ||
resources_per_trial={"gpu": "2"}, | ||
metrics_collector_config={"kind": "Push"}, | ||
) | ||
|
||
# Step 4. Get the best HyperParameters. | ||
print(katib_client.get_optimal_hyperparameters(name)) | ||
``` | ||
|
||
## Implementation | ||
Electronic-Waste marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Add New Parameter in `tune` | ||
|
||
As mentioned above, we decided to add `metrics_collector_config` to the tune function in Python SDK. Also, we have some changes to be made: | ||
|
||
1. Configure the way of metrics collection: set the configuration `spec.metricsCollectionSpec.collector.kind`(specify the way of metrics collection) to `Push`. | ||
|
||
2. Rename metrics collector from `None` to `Push`: It's not correct to call push-based metrics collection `None`. We should modify related code to rename it. | ||
|
||
3. Write env variables into Trial spec: set `KATIB_TRIAL_NAME` for `report_metrics` function to dial db manager. | ||
|
||
### New Interface `report_metrics` in Python SDK | ||
|
||
We decide to implement this funcion to push metrics directly to Katib DB with the help of grpc. Trial name should always be passed into Katib Trials (and then into this function) as env variable `KATIB_TRIAL_NAME`. | ||
|
||
Also, the function is supposed to be implemented as **global function** because it is called in the user container. | ||
|
||
Steps: | ||
|
||
1. Wrap metrics into `katib_api_pb2.ReportObservationLogRequest`: | ||
|
||
Firstly, convert metrics (in dict format) into `katib_api_pb2.ReportObservationLogRequest` type for the following grpc call, referring to https://github.com/kubeflow/katib/blob/master/pkg/apis/manager/v1beta1/gen-doc/api.md#reportobservationlogrequest | ||
|
||
2. Dial Katib DBManager Service | ||
|
||
We'll create a DBManager Stub and make a grpc call to report metrics to Katib DB. | ||
|
||
### Compatibility Changes in Trial Controller | ||
|
||
We need to make appropriate changes in the Trial controller to make sure we insert unavailable value into Katib DB, if user doesn't report metric accidentally. The current implementation handles unavailable metrics in: | ||
|
||
```Golang | ||
// If observation is empty metrics collector doesn't finish. | ||
// For early stopping metrics collector are reported logs before Trial status is changed to EarlyStopped. | ||
if jobStatus.Condition == trialutil.JobSucceeded && instance.Status.Observation == nil { | ||
logger.Info("Trial job is succeeded but metrics are not reported, reconcile requeued") | ||
return errMetricsNotReported | ||
} | ||
``` | ||
1. Distinguish pull-based and push-based metrics collection | ||
|
||
We decide to add a if-else statement in the code above to distinguish pull-based and push-based metrics collection. In the push-based collection, the Trial does not need to be requeued. Instead, we'll insert a unavailable value to Katib DB. | ||
|
||
2. Update the status of Trial to `MetricsUnavailable` | ||
|
||
In the current implementation of pull-based metrics collection, Trials will be re-queued when the metrics collector finds the `.Status.Observation` is empty. However, it's not compatible with push-based metrics collection because the forgotten metrics won't be reported in the new round of reconcile. So, we need to update its status in the function `UpdateTrialStatusCondition` in accommodation with the pull-based metrics collection. The following code will be insert into lines before [trial_controller_util.go#L69](https://github.com/kubeflow/katib/blob/7959ffd54851216dbffba791e1da13c8485d1085/pkg/controller.v1beta1/trial/trial_controller_util.go#L69) | ||
|
||
|
||
```Golang | ||
else if instance.Spec.MetricCollector.Collector.Kind == "Push" { | ||
... // Update the status of this Trial to `MetricsUnavailable` and output the reason. | ||
} | ||
``` | ||
|
||
### Collection of Final Metrics | ||
|
||
The final metrics of worker pods should be pushed to Katib DB directly in the push mode of metrics collection. | ||
|
||
\#WIP | ||
Electronic-Waste marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.