Katib metrics collector fails at end of trial

/kind bug

**What steps did you take and what happened:**
At the end of a trial, the training succeeds, but the container `metrics-logger-and-collector` fails with the following error : 
```
I0223 23:36:09.892466      15 main.go:139] 2023-02-23T23:36:09.891114+00:00 step 405:
I0223 23:36:09.892497      15 main.go:139] 2023-02-23T23:36:09.891114+00:00 COCOMetric=0.0
I0223 23:36:09.892519      15 main.go:139] 
I0223 23:36:17.028058      15 main.go:139] 2023-02-23T23:36:17.027645+00:00 step 424:
I0223 23:36:17.028100      15 main.go:139] 
I0223 23:36:23.943666      15 main.go:139] 2023-02-23T23:36:23.943312+00:00 step 449:
I0223 23:36:23.943701      15 main.go:139] 
I0223 23:36:30.862968      15 main.go:139] 2023-02-23T23:36:30.862618+00:00 step 474:
I0223 23:36:30.863010      15 main.go:139] 
I0223 23:36:37.776336      15 main.go:139] 2023-02-23T23:36:37.775965+00:00 step 499:
I0223 23:36:37.776396      15 main.go:139] 
I0223 23:36:44.553677      15 main.go:139] 2023-02-23T23:36:44.553346+00:00 step 524:
I0223 23:36:44.553717      15 main.go:139] 
I0223 23:36:51.373176      15 main.go:139] 2023-02-23T23:36:51.372855+00:00 step 549:
I0223 23:36:51.373224      15 main.go:139] 
I0223 23:36:58.237768      15 main.go:139] 2023-02-23T23:36:58.237402+00:00 step 574:
I0223 23:36:58.237815      15 main.go:139] 
I0223 23:37:30.847170      15 main.go:139] 2023-02-23T23:37:30.846809+00:00 step 599:
I0223 23:37:30.847196      15 main.go:139] 
F0223 23:37:33.549047      15 main.go:445] Failed to collect logs: format must be set TEXT or JSON
```
![image](https://user-images.githubusercontent.com/26939775/221060780-77bdb018-c764-4999-a38f-a4540ed451b9.png)

**What did you expect to happen:**
I expected the metric collector to not fail.  


**Anything else you would like to add:**
We create a katib as part of kubeflow pipeline component.  
The pipeline was working just fine for 6 months and today the Katib part no longer works.  
The katib experiment is created using Katib sdk.  
Here is a snippet of how we do that :  
```python
metrics_collector_spec=V1beta1MetricsCollectorSpec(
        collector=V1beta1CollectorSpec(
            kind="File"
        ),
        source=V1beta1SourceSpec(
            file_system_path=V1beta1FileSystemPath(
                kind="File",
                path=DEFAULT_METRICS_PATH
            ),
            filter=V1beta1FilterSpec(
                metrics_format=["([\w|-]+)\s*=\s*([+-]?\d(\.\d+)?([Ee][+-]?\d+)?)"]
            )
        )
    )

    experiment_spec=V1beta1ExperimentSpec(
        max_trial_count=max_trial_count,
        max_failed_trial_count=max_failed_trial_count,
        parallel_trial_count=parallel_trial_count,
        objective=objective,
        algorithm=algorithm,
        early_stopping=early_stopping,
        parameters=parameters,
        trial_template=trial_template,
        metrics_collector_spec=metrics_collector_spec
    )
    
    experiment = V1beta1Experiment(
        api_version="kubeflow.org/v1beta1",
        kind="Experiment",
        metadata=V1ObjectMeta(
            name=experiment_name,
            namespace=namespace
        ),
        spec=experiment_spec
    )

    katib_client = KatibClient()
    katib_client.create_experiment(experiment, namespace=namespace)
    
    logger.info("Katib hyperparameters tuning job created!")
```  
Here is the portion about metrics collector from the experiment yaml that gets produced (checked using `kubectl` when the experiment was running):  
```yaml
spec:
    algorithm:
      algorithmName: random
    earlyStopping:
      algorithmName: medianstop
      algorithmSettings:
      - name: min_trials_required
        value: "1"
    maxFailedTrialCount: 1
    maxTrialCount: 5
    metricsCollectorSpec:
      collector:
        kind: File
      source:
        fileSystemPath:
          kind: File
          path: /var/log/katib/metrics.log
        filter:
          metricsFormat:
          - ([\w|-]+)\s*=\s*([+-]?\d(\.\d+)?([Ee][+-]?\d+)?)
```  
During training we have a custom pytorch lightning logger that does the following (this is the relevant part that essentially writes to a file using the format):  
```python
def log_metrics(self, metrics_dict: Dict[str, float], step: Optional[int] = None) -> None:
        if step is None:
            return
        
        print(f"Logging metrics for step {step}...")
        
        step_metrics = ""

        timestamp = datetime.datetime.now(datetime.timezone.utc).isoformat()

        step_metrics += f"{timestamp} step {step}:\n"
        for k,v in metrics_dict.items():
            if k in self.metric_keys_allowed:
                step_metrics += f"{timestamp} {k}={v}\n"
        step_metrics += "\n"
        self.metrics += step_metrics
```  
We use the format `timestamp key=value` and it was working fine before.  
  
I suspect that the issue comes from the fact that the metrics logger image used by kubeflow seems to be `docker.io/kubeflowkatib/file-metrics-collector:latest` so maybe an incomaptibily was introduced.  
Is there a way to pin the version if that's the case?

**Environment:**

- Kubeflow 1.4.1 (AWS)
- Katib version (check the Katib controller image version): 0.12.0
- kubeflow-katib sdk version: Tried: `0.12.0`, `0.13.0` and `0.14.0` without success
- Kubernetes version: (`kubectl version`): 1.21
- OS (`uname -a`): `nvidia/cuda:11.1.1-base-ubuntu20.04` docker image
Katib controller : 
```
katib-controller:
    Container ID:  docker://551b5318b8498591d0461ccad58df307d3e5f4cabc8eb168363a9a57da4335c1
    Image:         docker.io/kubeflowkatib/katib-controller:v0.12.0
    Image ID:      docker-pullable://kubeflowkatib/katib-controller@sha256:12a28c8a0b41005537883e66826f4a89d33961348488c9e0bc6074eb0cf3cda4
```
---



Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Katib metrics collector fails at end of trial #2124

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Katib metrics collector fails at end of trial #2124

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions