-
Notifications
You must be signed in to change notification settings - Fork 485
Description
/kind bug
What steps did you take and what happened:
At the end of a trial, the training succeeds, but the container metrics-logger-and-collector
fails with the following error :
I0223 23:36:09.892466 15 main.go:139] 2023-02-23T23:36:09.891114+00:00 step 405:
I0223 23:36:09.892497 15 main.go:139] 2023-02-23T23:36:09.891114+00:00 COCOMetric=0.0
I0223 23:36:09.892519 15 main.go:139]
I0223 23:36:17.028058 15 main.go:139] 2023-02-23T23:36:17.027645+00:00 step 424:
I0223 23:36:17.028100 15 main.go:139]
I0223 23:36:23.943666 15 main.go:139] 2023-02-23T23:36:23.943312+00:00 step 449:
I0223 23:36:23.943701 15 main.go:139]
I0223 23:36:30.862968 15 main.go:139] 2023-02-23T23:36:30.862618+00:00 step 474:
I0223 23:36:30.863010 15 main.go:139]
I0223 23:36:37.776336 15 main.go:139] 2023-02-23T23:36:37.775965+00:00 step 499:
I0223 23:36:37.776396 15 main.go:139]
I0223 23:36:44.553677 15 main.go:139] 2023-02-23T23:36:44.553346+00:00 step 524:
I0223 23:36:44.553717 15 main.go:139]
I0223 23:36:51.373176 15 main.go:139] 2023-02-23T23:36:51.372855+00:00 step 549:
I0223 23:36:51.373224 15 main.go:139]
I0223 23:36:58.237768 15 main.go:139] 2023-02-23T23:36:58.237402+00:00 step 574:
I0223 23:36:58.237815 15 main.go:139]
I0223 23:37:30.847170 15 main.go:139] 2023-02-23T23:37:30.846809+00:00 step 599:
I0223 23:37:30.847196 15 main.go:139]
F0223 23:37:33.549047 15 main.go:445] Failed to collect logs: format must be set TEXT or JSON
What did you expect to happen:
I expected the metric collector to not fail.
Anything else you would like to add:
We create a katib as part of kubeflow pipeline component.
The pipeline was working just fine for 6 months and today the Katib part no longer works.
The katib experiment is created using Katib sdk.
Here is a snippet of how we do that :
metrics_collector_spec=V1beta1MetricsCollectorSpec(
collector=V1beta1CollectorSpec(
kind="File"
),
source=V1beta1SourceSpec(
file_system_path=V1beta1FileSystemPath(
kind="File",
path=DEFAULT_METRICS_PATH
),
filter=V1beta1FilterSpec(
metrics_format=["([\w|-]+)\s*=\s*([+-]?\d(\.\d+)?([Ee][+-]?\d+)?)"]
)
)
)
experiment_spec=V1beta1ExperimentSpec(
max_trial_count=max_trial_count,
max_failed_trial_count=max_failed_trial_count,
parallel_trial_count=parallel_trial_count,
objective=objective,
algorithm=algorithm,
early_stopping=early_stopping,
parameters=parameters,
trial_template=trial_template,
metrics_collector_spec=metrics_collector_spec
)
experiment = V1beta1Experiment(
api_version="kubeflow.org/v1beta1",
kind="Experiment",
metadata=V1ObjectMeta(
name=experiment_name,
namespace=namespace
),
spec=experiment_spec
)
katib_client = KatibClient()
katib_client.create_experiment(experiment, namespace=namespace)
logger.info("Katib hyperparameters tuning job created!")
Here is the portion about metrics collector from the experiment yaml that gets produced (checked using kubectl
when the experiment was running):
spec:
algorithm:
algorithmName: random
earlyStopping:
algorithmName: medianstop
algorithmSettings:
- name: min_trials_required
value: "1"
maxFailedTrialCount: 1
maxTrialCount: 5
metricsCollectorSpec:
collector:
kind: File
source:
fileSystemPath:
kind: File
path: /var/log/katib/metrics.log
filter:
metricsFormat:
- ([\w|-]+)\s*=\s*([+-]?\d(\.\d+)?([Ee][+-]?\d+)?)
During training we have a custom pytorch lightning logger that does the following (this is the relevant part that essentially writes to a file using the format):
def log_metrics(self, metrics_dict: Dict[str, float], step: Optional[int] = None) -> None:
if step is None:
return
print(f"Logging metrics for step {step}...")
step_metrics = ""
timestamp = datetime.datetime.now(datetime.timezone.utc).isoformat()
step_metrics += f"{timestamp} step {step}:\n"
for k,v in metrics_dict.items():
if k in self.metric_keys_allowed:
step_metrics += f"{timestamp} {k}={v}\n"
step_metrics += "\n"
self.metrics += step_metrics
We use the format timestamp key=value
and it was working fine before.
I suspect that the issue comes from the fact that the metrics logger image used by kubeflow seems to be docker.io/kubeflowkatib/file-metrics-collector:latest
so maybe an incomaptibily was introduced.
Is there a way to pin the version if that's the case?
Environment:
- Kubeflow 1.4.1 (AWS)
- Katib version (check the Katib controller image version): 0.12.0
- kubeflow-katib sdk version: Tried:
0.12.0
,0.13.0
and0.14.0
without success - Kubernetes version: (
kubectl version
): 1.21 - OS (
uname -a
):nvidia/cuda:11.1.1-base-ubuntu20.04
docker image
Katib controller :
katib-controller:
Container ID: docker://551b5318b8498591d0461ccad58df307d3e5f4cabc8eb168363a9a57da4335c1
Image: docker.io/kubeflowkatib/katib-controller:v0.12.0
Image ID: docker-pullable://kubeflowkatib/katib-controller@sha256:12a28c8a0b41005537883e66826f4a89d33961348488c9e0bc6074eb0cf3cda4
Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍