[GSoC] Add e2e test for tune api with LLM hyperparameter optimization (kubeflow#2420)

helenxie-bit · web-flow · commit 73b8c5c02962 · 2025-06-26T14:13:16.000Z
* add e2e test for tune api

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* upgrade training-operator sdk

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* specify the version of training operator sdk

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* fix num_labels error and update the version of training operator controller

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* check the version of training operator

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* debug

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* check import path of HuggingFaceModelParams

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* update the version of training operator sdk

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* update the name of experiment

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* add step of checking pod

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* check the logs of pod

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* add check

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* check reason for imagepullbackoff

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* revert timeout limit

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* fix format

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* extend timeout limit

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* update training operator sdk version

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* check the logs of pod

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* rerun tests

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* update the function of getting logs

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* add the step of describing pod

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* check disk space

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* change work directory

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* change work directory

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* increase timeout limit

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* check the logs of controller and events

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* change work directory

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* change work directory

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* change work directory

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* check the logs of kubelet

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* check the logs of kubelet

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* increase cpu

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* check the logs of training operator

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* check the use of resources

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* check the logs of container 'pytorch' and 'storage_initializer'

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* fix error of checking use of resources

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* add other checks to find the error reason

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* set 'storage_config'

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* reduce the number of tests

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* Check container runtime logs

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* set the driver of minikube as docker

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* set the driver of minikube to none

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* check logs of pod

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* check memory usage

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* increase 'termination_grace_period_seconds' in podspec

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* fix annotations error

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* restart docker

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* delete restarting docker

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* use original docker data directory

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* update installation of Katib SDK with extra requires

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* test trainer image built with cpu

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* add action of free up disk space (including move docker data directory)

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* delete unnecessary checks and update the part of fetching pod description and logs

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* delete fetching pod logs

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* add blank line at the end of free-up-disk-space yaml file

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* update experiment name

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* update test function name to be consistent with experiment name

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* move import statements inside the function

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* apply pprint for the logging output

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* update experiment names

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* fix format

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* fix format

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* fix the sequence of arguments in 'trial_template'

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* test example in user guide

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* fix access token error

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* fix the error of setup

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* fix the error of setup

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* reverse back

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* fix format

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

* fix format

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;

---------

Signed-off-by: helenxie-bit &lt;helenxiehz@gmail.com&gt;
diff --git a/.github/workflows/e2e-test-tune-api.yaml b/.github/workflows/e2e-test-tune-api.yaml
@@ -21,7 +21,12 @@ jobs:
         uses: ./.github/workflows/template-setup-e2e-test
         with:
           kubernetes-version: ${{ matrix.kubernetes-version }}
-
+      
+      - name: Install Katib SDK with extra requires
+        shell: bash
+        run: |
+          pip install --prefer-binary -e 'sdk/python/v1beta1[huggingface]'
+      
       - name: Run e2e test with tune API
         uses: ./.github/workflows/template-e2e-test
         with:
diff --git a/.github/workflows/free-up-disk-space/action.yaml b/.github/workflows/free-up-disk-space/action.yaml
@@ -0,0 +1,49 @@
+name: Free-Up Disk Space
+description: Remove Non-Essential Tools And Move Docker Data Directory to /mnt/docker
+
+runs:
+  using: composite
+  steps:
+    # This step is a Workaround to avoid the "No space left on device" error.
+    # ref: https://github.com/actions/runner-images/issues/2840
+    - name: Remove unnecessary files
+      shell: bash
+      run: |
+        echo "Disk usage before cleanup:"
+        df -hT
+
+        sudo rm -rf /usr/share/dotnet
+        sudo rm -rf /opt/ghc
+        sudo rm -rf /usr/local/share/boost
+        sudo rm -rf "$AGENT_TOOLSDIRECTORY"
+        sudo rm -rf /usr/local/lib/android
+        sudo rm -rf /usr/local/share/powershell
+        sudo rm -rf /usr/share/swift
+
+        echo "Disk usage after cleanup:"
+        df -hT
+
+    - name: Prune docker images
+      shell: bash
+      run: |
+        docker image prune -a -f
+        docker system df
+        df -hT
+
+    - name: Move docker data directory
+      shell: bash
+      run: |
+        echo "Stopping docker service ..."
+        sudo systemctl stop docker
+        DOCKER_DEFAULT_ROOT_DIR=/var/lib/docker
+        DOCKER_ROOT_DIR=/mnt/docker
+        echo "Moving ${DOCKER_DEFAULT_ROOT_DIR} -> ${DOCKER_ROOT_DIR}"
+        sudo mv ${DOCKER_DEFAULT_ROOT_DIR} ${DOCKER_ROOT_DIR}
+        echo "Creating symlink ${DOCKER_DEFAULT_ROOT_DIR} -> ${DOCKER_ROOT_DIR}"
+        sudo ln -s ${DOCKER_ROOT_DIR} ${DOCKER_DEFAULT_ROOT_DIR}
+        echo "$(sudo ls -l ${DOCKER_DEFAULT_ROOT_DIR})"
+        echo "Starting docker service ..."
+        sudo systemctl daemon-reload
+        sudo systemctl start docker
+        echo "Docker service status:"
+        sudo systemctl --no-pager -l -o short status docker
diff --git a/.github/workflows/template-setup-e2e-test/action.yaml b/.github/workflows/template-setup-e2e-test/action.yaml
@@ -17,19 +17,8 @@ runs:
   steps:
     # This step is a Workaround to avoid the "No space left on device" error.
     # ref: https://github.com/actions/runner-images/issues/2840
-    - name: Remove unnecessary files
-      shell: bash
-      run: |
-        sudo rm -rf /usr/share/dotnet
-        sudo rm -rf /opt/ghc
-        sudo rm -rf "/usr/local/share/boost"
-        sudo rm -rf "$AGENT_TOOLSDIRECTORY"
-        sudo rm -rf /usr/local/lib/android
-        sudo rm -rf /usr/local/share/powershell
-        sudo rm -rf /usr/share/swift
-
-        echo "Disk usage after cleanup:"
-        df -h
+    - name: Free-Up Disk Space
+      uses: ./.github/workflows/free-up-disk-space
 
     - name: Setup kubectl
       uses: azure/setup-kubectl@v4
diff --git a/sdk/python/v1beta1/kubeflow/katib/api/katib_client.py b/sdk/python/v1beta1/kubeflow/katib/api/katib_client.py
@@ -692,8 +692,8 @@ class name in this argument.
                 retain_trials,
                 trial_parameters,
                 resources_per_trial,
-                worker_pod_template_spec,
                 master_pod_template_spec,
+                worker_pod_template_spec,
             )
 
         # Add parameters to the Katib Experiment.
diff --git a/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py b/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py
@@ -1,6 +1,8 @@
 import argparse
 import logging
+from pprint import pformat
 
+import kubeflow.katib as katib
 from kubeflow.katib import KatibClient, search
 from kubeflow.katib.types.types import TrainerResources
 from kubernetes import client
@@ -12,7 +14,6 @@
 # The default logging config.
 logging.basicConfig(level=logging.INFO)
 
-
 def run_e2e_experiment_create_by_tune(
     katib_client: KatibClient,
     exp_name: str,
@@ -53,9 +54,8 @@ def objective(parameters):
     verify_experiment_results(katib_client, experiment, exp_name, exp_namespace)
 
     # Print the Experiment and Suggestion.
-    logging.debug(katib_client.get_experiment(exp_name, exp_namespace))
-    logging.debug(katib_client.get_suggestion(exp_name, exp_namespace))
-
+    logging.debug("Experiment:\n%s", pformat(katib_client.get_experiment(exp_name, exp_namespace)))
+    logging.debug("Suggestion:\n%s", pformat(katib_client.get_suggestion(exp_name, exp_namespace)))
 
 def run_e2e_experiment_create_by_tune_pytorchjob(
     katib_client: KatibClient,
@@ -115,9 +115,85 @@ def objective(parameters):
     verify_experiment_results(katib_client, experiment, exp_name, exp_namespace)
 
     # Print the Experiment and Suggestion.
-    logging.debug(katib_client.get_experiment(exp_name, exp_namespace))
-    logging.debug(katib_client.get_suggestion(exp_name, exp_namespace))
+    logging.debug("Experiment:\n%s", pformat(katib_client.get_experiment(exp_name, exp_namespace)))
+    logging.debug("Suggestion:\n%s", pformat(katib_client.get_suggestion(exp_name, exp_namespace)))
+
+def run_e2e_experiment_create_by_tune_with_llm_optimization(
+    katib_client: KatibClient,
+    exp_name: str,
+    exp_namespace: str,
+):
+    import transformers
+    from kubeflow.storage_initializer.hugging_face import (
+        HuggingFaceDatasetParams,
+        HuggingFaceModelParams,
+        HuggingFaceTrainerParams,
+    )
+    from peft import LoraConfig
+
+    # Create Katib Experiment and wait until it is finished.
+    logging.debug("Creating Experiment: {}/{}".format(exp_namespace, exp_name))
+    
+    # Use the test case from fine-tuning API tutorial.
+    # https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/
+    # Create Katib Experiment.
+    # And Wait until Experiment reaches Succeeded condition.
+    katib_client.tune(
+        name=exp_name,
+        namespace=exp_namespace,
+        # BERT model URI and type of Transformer to train it.
+        model_provider_parameters=HuggingFaceModelParams(
+            model_uri="hf://google-bert/bert-base-cased",
+            transformer_type=transformers.AutoModelForSequenceClassification,
+            num_labels=5,
+        ),
+        # In order to save test time, use 8 samples from Yelp dataset.
+        dataset_provider_parameters=HuggingFaceDatasetParams(
+            repo_id="yelp_review_full",
+            split="train[:8]",
+        ),
+        # Specify HuggingFace Trainer parameters.
+        trainer_parameters=HuggingFaceTrainerParams(
+            training_parameters=transformers.TrainingArguments(
+                output_dir="test_tune_api",
+                save_strategy="no",
+                learning_rate = search.double(min=1e-05, max=5e-05),
+                num_train_epochs=1,
+            ),
+            # Set LoRA config to reduce number of trainable model parameters.
+            lora_config=LoraConfig(
+                r = search.int(min=8, max=32),
+                lora_alpha=8,
+                lora_dropout=0.1,
+                bias="none",
+            ),
+        ),
+        objective_metric_name = "train_loss", 
+        objective_type = "minimize", 
+        algorithm_name = "random",
+        max_trial_count = 1,
+        parallel_trial_count = 1,
+        resources_per_trial=katib.TrainerResources(
+            num_workers=1,
+            num_procs_per_worker=1,
+            resources_per_worker={"cpu": "2", "memory": "10G",},
+        ),
+        storage_config={
+            "size": "10Gi",
+            "access_modes": ["ReadWriteOnce"],
+        },
+        retain_trials=True,
+    )
+    experiment = katib_client.wait_for_experiment_condition(
+        exp_name, exp_namespace, timeout=EXPERIMENT_TIMEOUT
+    )
+
+    # Verify the Experiment results.
+    verify_experiment_results(katib_client, experiment, exp_name, exp_namespace)
 
+    # Print the Experiment and Suggestion.
+    logging.debug("Experiment:\n%s", pformat(katib_client.get_experiment(exp_name, exp_namespace)))
+    logging.debug("Suggestion:\n%s", pformat(katib_client.get_suggestion(exp_name, exp_namespace)))
 
 if __name__ == "__main__":
     parser = argparse.ArgumentParser()
@@ -189,3 +265,19 @@ def objective(parameters):
         logging.info("---------------------------------------------------------------")
         logging.info("---------------------------------------------------------------")
         katib_client.delete_experiment(exp_name, exp_namespace)
+
+    exp_name = "tune-example-llm-optimization"
+    exp_namespace = args.namespace
+    try:
+        run_e2e_experiment_create_by_tune_with_llm_optimization(katib_client, exp_name, exp_namespace)
+        logging.info("---------------------------------------------------------------")
+        logging.info(f"E2E is succeeded for Experiment created by tune: {exp_namespace}/{exp_name}")
+    except Exception as e:
+        logging.info("---------------------------------------------------------------")
+        logging.info(f"E2E is failed for Experiment created by tune: {exp_namespace}/{exp_name}")
+        raise e
+    finally:
+        # Delete the Experiment.
+        logging.info("---------------------------------------------------------------")
+        logging.info("---------------------------------------------------------------")
+        katib_client.delete_experiment(exp_name, exp_namespace)

Original file line number	Diff line number	Diff line change
`@@ -692,8 +692,8 @@ class name in this argument.`
`692`	`692`	`retain_trials,`
`693`	`693`	`trial_parameters,`
`694`	`694`	`resources_per_trial,`
`695`		`- worker_pod_template_spec,`
`696`	`695`	`master_pod_template_spec,`
	`696`	`+ worker_pod_template_spec,`
`697`	`697`	`)`
`698`	`698`
`699`	`699`	`# Add parameters to the Katib Experiment.`