Skip to content

Katib metrics collector fails at end of trial #2124

@AlexandreBrown

Description

@AlexandreBrown

/kind bug

What steps did you take and what happened:
At the end of a trial, the training succeeds, but the container metrics-logger-and-collector fails with the following error :

I0223 23:36:09.892466      15 main.go:139] 2023-02-23T23:36:09.891114+00:00 step 405:
I0223 23:36:09.892497      15 main.go:139] 2023-02-23T23:36:09.891114+00:00 COCOMetric=0.0
I0223 23:36:09.892519      15 main.go:139] 
I0223 23:36:17.028058      15 main.go:139] 2023-02-23T23:36:17.027645+00:00 step 424:
I0223 23:36:17.028100      15 main.go:139] 
I0223 23:36:23.943666      15 main.go:139] 2023-02-23T23:36:23.943312+00:00 step 449:
I0223 23:36:23.943701      15 main.go:139] 
I0223 23:36:30.862968      15 main.go:139] 2023-02-23T23:36:30.862618+00:00 step 474:
I0223 23:36:30.863010      15 main.go:139] 
I0223 23:36:37.776336      15 main.go:139] 2023-02-23T23:36:37.775965+00:00 step 499:
I0223 23:36:37.776396      15 main.go:139] 
I0223 23:36:44.553677      15 main.go:139] 2023-02-23T23:36:44.553346+00:00 step 524:
I0223 23:36:44.553717      15 main.go:139] 
I0223 23:36:51.373176      15 main.go:139] 2023-02-23T23:36:51.372855+00:00 step 549:
I0223 23:36:51.373224      15 main.go:139] 
I0223 23:36:58.237768      15 main.go:139] 2023-02-23T23:36:58.237402+00:00 step 574:
I0223 23:36:58.237815      15 main.go:139] 
I0223 23:37:30.847170      15 main.go:139] 2023-02-23T23:37:30.846809+00:00 step 599:
I0223 23:37:30.847196      15 main.go:139] 
F0223 23:37:33.549047      15 main.go:445] Failed to collect logs: format must be set TEXT or JSON

image

What did you expect to happen:
I expected the metric collector to not fail.

Anything else you would like to add:
We create a katib as part of kubeflow pipeline component.
The pipeline was working just fine for 6 months and today the Katib part no longer works.
The katib experiment is created using Katib sdk.
Here is a snippet of how we do that :

metrics_collector_spec=V1beta1MetricsCollectorSpec(
        collector=V1beta1CollectorSpec(
            kind="File"
        ),
        source=V1beta1SourceSpec(
            file_system_path=V1beta1FileSystemPath(
                kind="File",
                path=DEFAULT_METRICS_PATH
            ),
            filter=V1beta1FilterSpec(
                metrics_format=["([\w|-]+)\s*=\s*([+-]?\d(\.\d+)?([Ee][+-]?\d+)?)"]
            )
        )
    )

    experiment_spec=V1beta1ExperimentSpec(
        max_trial_count=max_trial_count,
        max_failed_trial_count=max_failed_trial_count,
        parallel_trial_count=parallel_trial_count,
        objective=objective,
        algorithm=algorithm,
        early_stopping=early_stopping,
        parameters=parameters,
        trial_template=trial_template,
        metrics_collector_spec=metrics_collector_spec
    )
    
    experiment = V1beta1Experiment(
        api_version="kubeflow.org/v1beta1",
        kind="Experiment",
        metadata=V1ObjectMeta(
            name=experiment_name,
            namespace=namespace
        ),
        spec=experiment_spec
    )

    katib_client = KatibClient()
    katib_client.create_experiment(experiment, namespace=namespace)
    
    logger.info("Katib hyperparameters tuning job created!")

Here is the portion about metrics collector from the experiment yaml that gets produced (checked using kubectl when the experiment was running):

spec:
    algorithm:
      algorithmName: random
    earlyStopping:
      algorithmName: medianstop
      algorithmSettings:
      - name: min_trials_required
        value: "1"
    maxFailedTrialCount: 1
    maxTrialCount: 5
    metricsCollectorSpec:
      collector:
        kind: File
      source:
        fileSystemPath:
          kind: File
          path: /var/log/katib/metrics.log
        filter:
          metricsFormat:
          - ([\w|-]+)\s*=\s*([+-]?\d(\.\d+)?([Ee][+-]?\d+)?)

During training we have a custom pytorch lightning logger that does the following (this is the relevant part that essentially writes to a file using the format):

def log_metrics(self, metrics_dict: Dict[str, float], step: Optional[int] = None) -> None:
        if step is None:
            return
        
        print(f"Logging metrics for step {step}...")
        
        step_metrics = ""

        timestamp = datetime.datetime.now(datetime.timezone.utc).isoformat()

        step_metrics += f"{timestamp} step {step}:\n"
        for k,v in metrics_dict.items():
            if k in self.metric_keys_allowed:
                step_metrics += f"{timestamp} {k}={v}\n"
        step_metrics += "\n"
        self.metrics += step_metrics

We use the format timestamp key=value and it was working fine before.

I suspect that the issue comes from the fact that the metrics logger image used by kubeflow seems to be docker.io/kubeflowkatib/file-metrics-collector:latest so maybe an incomaptibily was introduced.
Is there a way to pin the version if that's the case?

Environment:

  • Kubeflow 1.4.1 (AWS)
  • Katib version (check the Katib controller image version): 0.12.0
  • kubeflow-katib sdk version: Tried: 0.12.0, 0.13.0 and 0.14.0 without success
  • Kubernetes version: (kubectl version): 1.21
  • OS (uname -a): nvidia/cuda:11.1.1-base-ubuntu20.04 docker image
    Katib controller :
katib-controller:
    Container ID:  docker://551b5318b8498591d0461ccad58df307d3e5f4cabc8eb168363a9a57da4335c1
    Image:         docker.io/kubeflowkatib/katib-controller:v0.12.0
    Image ID:      docker-pullable://kubeflowkatib/katib-controller@sha256:12a28c8a0b41005537883e66826f4a89d33961348488c9e0bc6074eb0cf3cda4

Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions