Metrics not reporting to Katib server - experiment timing out


I am trying to create an experiment in kubeflow pipeline using python where I can hyperparameter tune an simple script. I want to use Katib to tune the hyper parameter using python (not by applying YAML file). The problem is that I cant report the metrics to katib server. Since the report is not happening, the experiment is timing out. So I need some help from the community.

Here is what I have tried :
1. I created a GKE cluster and installed katib and training-operator and kubeflow pipelines on it.
2. I tried to create an experiment using TFJob as given below:
```

trial_spec = {
    "apiVersion": "kubeflow.org/v1",
    "kind": "TFJob",
    "spec": {
        "tfReplicaSpecs": {
            "PS": {
                "replicas": 1,
                "restartPolicy": "Never",
                "template": {
                    "spec": {
                        "containers": [
                            {
                                "name": "tensorflow",
                                "image": "<image_name>",
                                "command": [
                                    "python",
                                    "/opt/trainer/task.py",
                                    "--epoch=${trialParameters.epoch}",
                                    "--batch_size=${trialParameters.batchSize}"
                                ]
                            }
                        ]
                    }
                }
            },
            "Worker": {
                "replicas": 1,
                "restartPolicy": "Never",
                "template": {
                    "spec": {
                        "containers": [
                            {
                                "name": "tensorflow",
                                "image": "<image_name>",
                                "command": [
                                    "python",
                                    "/opt/trainer/task.py",
                                    "--epoch=${trialParameters.epoch}",
                                    "--batch_size=${trialParameters.batchSize}"
                                ]
                            }
                        ]
                    }
                }
            }
        }
    }
}

```
The above given JSON is my trial spec. I have given the entire pipeline code below:

```
import kfp
import kfp.dsl as dsl
from kfp import components

from kubeflow.katib import ApiClient
from kubeflow.katib import V1beta1ExperimentSpec
from kubeflow.katib import V1beta1AlgorithmSpec
from kubeflow.katib import V1beta1ObjectiveSpec
from kubeflow.katib import V1beta1ParameterSpec
from kubeflow.katib import V1beta1FeasibleSpace
from kubeflow.katib import V1beta1TrialTemplate
from kubeflow.katib import V1beta1TrialParameterSpec
from kubeflow.katib import V1beta1MetricsCollectorSpec
from kubeflow.katib import V1beta1CollectorSpec
from kubeflow.katib import V1beta1FileSystemPath
from kubeflow.katib import V1beta1SourceSpec
from kubeflow.katib import V1beta1FilterSpec
# experiment_name = "tf-test17"
# experiment_namespace = "kubeflow"

# Trial count specification.
max_trial_count = 2
max_failed_trial_count = 2
parallel_trial_count = 1

# Objective specification.
objective = V1beta1ObjectiveSpec(
    type="minimize",
    # goal=100,
    objective_metric_name="loss"
    # additional_metric_names=["accuracy"]
)


# Objective specification.
metrics_collector_specs = V1beta1MetricsCollectorSpec(
    collector=V1beta1CollectorSpec(kind="File"),
    source=V1beta1SourceSpec(
        file_system_path=V1beta1FileSystemPath(
            # format="TEXT",
            path="/opt/trainer/katib/metrics.log",
            kind="File"
        ),
        filter=V1beta1FilterSpec(
            metrics_format=["{metricName: ([\\w|-]+), metricValue: ((-?\\d+)(\\.\\d+)?)}"]
        
        )
    )
)

# Algorithm specification.
algorithm = V1beta1AlgorithmSpec(
    algorithm_name="random",
)

# Experiment search space.
# In this example we tune learning rate and batch size.
parameters = [
    V1beta1ParameterSpec(
        name="epoch",
        parameter_type="int",
        feasible_space=V1beta1FeasibleSpace(
            min="5",
            max="12"
        ),
    ),
    V1beta1ParameterSpec(
        name="batch_size",
        parameter_type="int",
        feasible_space=V1beta1FeasibleSpace(
            min="12",
            max="32"
        ),
    )
]

# Experiment Trial template.


# TODO (andreyvelich): Use community image for the mnist example.
trial_spec = {
    "apiVersion": "kubeflow.org/v1",
    "kind": "TFJob",
    "spec": {
        "tfReplicaSpecs": {
            "PS": {
                "replicas": 1,
                "restartPolicy": "Never",
                "template": {
                    "spec": {
                        "containers": [
                            {
                                "name": "tensorflow",
                                "image": "<image_name>",
                                "command": [
                                    "python",
                                    "/opt/trainer/task.py",
                                    "--epoch=${trialParameters.epoch}",
                                    "--batch_size=${trialParameters.batchSize}"
                                ]
                            }
                        ]
                    }
                }
            },
            "Worker": {
                "replicas": 1,
                "restartPolicy": "Never",
                "template": {
                    "spec": {
                        "containers": [
                            {
                                "name": "tensorflow",
                                "image": "<image_name>",
                                "command": [
                                    "python",
                                    "/opt/trainer/task.py",
                                    "--epoch=${trialParameters.epoch}",
                                    "--batch_size=${trialParameters.batchSize}"
                                ]
                            }
                        ]
                    }
                }
            }
        }
    }
}

# Configure parameters for the Trial template.
trial_template = V1beta1TrialTemplate(
    primary_container_name="tensorflow",
    trial_parameters=[
        V1beta1TrialParameterSpec(
            name="epoch",
            description="epoch",
            reference="epoch"
        ),
        V1beta1TrialParameterSpec(
            name="batchSize",
            description="Batch size for the model",
            reference="batch_size"
        ),
    ],
    trial_spec=trial_spec
)

# Create an Experiment from the above parameters.
experiment_spec = V1beta1ExperimentSpec(
    max_trial_count=max_trial_count,
    max_failed_trial_count=max_failed_trial_count,
    parallel_trial_count=parallel_trial_count,
    # metrics_collector_spec=metrics_collector_specs,
    objective=objective,
    algorithm=algorithm,
    parameters=parameters,
    trial_template=trial_template
)

# Create the KFP task for the Katib Experiment.
# Experiment Spec should be serialized to a valid Kubernetes object.
katib_experiment_launcher_op = components.load_component_from_file("component.yaml")


@dsl.pipeline(
    name="Launch Katib early stopping Experiment",
    description="An example to launch Katib Experiment with early stopping"
)

def pipeline_func(

    experiment_name: str = "tf-test-1",
    experiment_namespace: str = "kubeflow",
    experiment_timeout_minutes: int = 5
):

    # Katib launcher component.
    # Experiment Spec should be serialized to a valid Kubernetes object.
    op = katib_experiment_launcher_op(
        experiment_name=experiment_name,
        experiment_namespace=experiment_namespace,
        experiment_spec=ApiClient().sanitize_for_serialization(experiment_spec),
        experiment_timeout_minutes=experiment_timeout_minutes,
        delete_finished_experiment=False)

    
    # restricting the maximum usable memory and cpu for preprocess stage
    op.set_memory_limit("8G")
    op.set_cpu_limit("1")
    
    # Output container to print the results.
    op_out = dsl.ContainerOp(
        name="best-hp",
        image="library/bash:4.4.23",
        command=["sh", "-c"],
        arguments=["echo Best HyperParameters: %s" % op.output],
    )
    
    op_out.set_memory_limit("4G")
    op_out.set_cpu_limit("1")

if __name__ == '__main__':
    
    # compiling the model and generating tar.gz file to upload to Kubeflow Pipeline UI
    import kfp.compiler as compiler

    compiler.Compiler().compile(
        pipeline_func, 'pipeline_tf_text.tar.gz'
    )

```
and this is my /opt/trainer/task.py code

```
import os
import re
import shutil
import string
import tensorflow as tf
import random
from tensorflow.keras import layers
from tensorflow.keras import losses
import argparse
import logging
from google.cloud import storage

os.mkdir("katib/")

logging.basicConfig(
            format="%(asctime)s %(levelname)-8s %(message)s",
            datefmt="%Y-%m-%dT%H:%M:%SZ",
            level=logging.DEBUG,
            filename="katib/metrics.log")


if __name__ == "__main__":
    
    try: 
        list1 = [1245.99, 7554.00, 725.66, 546.88, 423.99, 7866.00]
        loss = random.choice(list1)
        logging.info("{{metricName: loss, metricValue: {:.4f}}};{{metricName: accuracy, metricValue: {:.4f}}}\n".format(loss, loss))
        
    except Exception as e:
        print(e)
     
```   
The task.py code had more training logic. I have removed it since I only have problem with reporting some metrics to the katib server. Since I am simply taking some random number from the list and reporting to katib server, I am hoping it should work but it doesnt.

3. I have also tried to give the path in file_system_path.path as /katib/metrics.log as given below.

```
# Objective specification.
metrics_collector_specs = V1beta1MetricsCollectorSpec(
    collector=V1beta1CollectorSpec(kind="File"),
    source=V1beta1SourceSpec(
        file_system_path=V1beta1FileSystemPath(
            # format="TEXT",
            path="katib/metrics.log",
            kind="File"
        ),
        filter=V1beta1FilterSpec(
            metrics_format=["{metricName: ([\\w|-]+), metricValue: ((-?\\d+)(\\.\\d+)?)}"]

        )
    )
)
```

This gives me an error as given below:

```
time="2022-06-23T05:53:55.247Z" level=info msg="capturing logs" argo=true
INFO:root:Creating Experiment: tf-test-1 in namespace: kubeflow
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/kubeflow/katib/api/katib_client.py", line 74, in create_experiment
    exp_object)
  File "/usr/local/lib/python3.6/site-packages/kubernetes/client/apis/custom_objects_api.py", line 178, in create_namespaced_custom_object
    (data) = self.create_namespaced_custom_object_with_http_info(group, version, namespace, plural, body, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/kubernetes/client/apis/custom_objects_api.py", line 277, in create_namespaced_custom_object_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python3.6/site-packages/kubernetes/client/api_client.py", line 334, in call_api
    _return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File "/usr/local/lib/python3.6/site-packages/kubernetes/client/api_client.py", line 168, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python3.6/site-packages/kubernetes/client/api_client.py", line 377, in request
    body=body)
  File "/usr/local/lib/python3.6/site-packages/kubernetes/client/rest.py", line 266, in POST
    body=body)
  File "/usr/local/lib/python3.6/site-packages/kubernetes/client/rest.py", line 222, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Audit-Id': '251506ca-f8e0-487a-9bea-9c87f435991b', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '4f14223c-b7ef-4a61-935e-760caef0517b', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'a0b375b1-ec21-45bd-a615-84415d25710b', 'Date': 'Thu, 23 Jun 2022 05:53:56 GMT', 'Content-Length': '279'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"admission webhook \"validator.experiment.katib.kubeflow.org\" denied the request: file path where metrics file exists is required by .spec.metricsCollectorSpec.source.fileSystemPath.path","code":400}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "src/launch_experiment.py", line 115, in <module>
    output = katib_client.create_experiment(experiment, namespace=experiment_namespace)
  File "/usr/local/lib/python3.6/site-packages/kubeflow/katib/api/katib_client.py", line 78, in create_experiment
    %s\n" % e)
RuntimeError: Exception when calling CustomObjectsApi->create_namespaced_custom_object:         (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Audit-Id': '251506ca-f8e0-487a-9bea-9c87f435991b', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '4f14223c-b7ef-4a61-935e-760caef0517b', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'a0b375b1-ec21-45bd-a615-84415d25710b', 'Date': 'Thu, 23 Jun 2022 05:53:56 GMT', 'Content-Length': '279'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"admission webhook \"validator.experiment.katib.kubeflow.org\" denied the request: file path where metrics file exists is required by .spec.metricsCollectorSpec.source.fileSystemPath.path","code":400}
time="2022-06-23T05:53:56.361Z" level=error msg="cannot save parameter /tmp/outputs/Best_Parameter_Set/data" argo=true error="open /tmp/outputs/Best_Parameter_Set/data: no such file or directory"
time="2022-06-23T05:53:56.361Z" level=error msg="cannot save artifact /tmp/outputs/Best_Parameter_Set/data" argo=true error="stat /tmp/outputs/Best_Parameter_Set/data: no such file or directory"
Error: exit status 1

```
So I changed the path to give the full path like /opt/trainer/katib/metrics.log. The experiment just time out when doing so.

5. I have also tried changing trial spec as follows :

```
trial_spec = {
    "apiVersion": "batch/v1",
    "kind": "Job",
    "spec": {
        "template": {
            "spec": {
                "containers": [
                    {
                        "name": "tensorflow",
                        "image": "<image_name>",
                        "command": [
                            "python",
                            "/opt/trainer/task.py",
                            "--epoch=${trialParameters.epoch}",
                            "--batch_size=${trialParameters.batchSize}"
                        ]
                    }
                ]
            }
        }
    }
}
```

FYI: the katib can get into my container and I can see the pods log saying its succeeded. but the metrics is not reporting. I would like some help from the community ASAP please.

Please comment if you need any more information. I have tried many other things but I cant post all the things here.

This is how I created my cluster and done all the installation

```
CLUSTER_NAME="kubeflow-pipelines-standalone-v2"
ZONE="us-central1-a"
MACHINE_TYPE="n1-standard-4"
SCOPES="cloud-platform"
NODES_NUM=1

gcloud container clusters create $CLUSTER_NAME --zone $ZONE --machine-type $MACHINE_TYPE --scopes $SCOPES --num-nodes $NODES_NUM

gcloud config set compute/zone $ZONE
gcloud container clusters get-credentials $CLUSTER_NAME

export PIPELINE_VERSION=1.8.1
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/dev?ref=$PIPELINE_VERSION"
# katib
kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=master"
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.3.0"
kubectl apply -f ./test.yaml
```

test.yaml file

```
apiVersion: v1
kind: Namespace
metadata:
  name: kubeflow
  labels:
    katib.kubeflow.org/metrics-collector-injection: enabled
```

references: 
1. https://github.com/kubeflow/katib/blob/master/examples/v1beta1/sdk/nas-with-darts.ipynb
2. https://www.kubeflow.org/docs/components/katib/experiment/#metrics-collector
3. https://github.com/kubeflow/katib/blob/master/examples/v1beta1/metrics-collector/file-metrics-collector.yaml#L13-L22
4. https://github.com/kubeflow/pipelines/blob/master/components/kubeflow/katib-launcher/component.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metrics not reporting to Katib server - experiment timing out #1905

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metrics not reporting to Katib server - experiment timing out #1905

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions