Skip to content

Commit 73b8c5c

Browse files
authored
[GSoC] Add e2e test for tune api with LLM hyperparameter optimization (kubeflow#2420)
* add e2e test for tune api Signed-off-by: helenxie-bit <[email protected]> * upgrade training-operator sdk Signed-off-by: helenxie-bit <[email protected]> * specify the version of training operator sdk Signed-off-by: helenxie-bit <[email protected]> * fix num_labels error and update the version of training operator controller Signed-off-by: helenxie-bit <[email protected]> * check the version of training operator Signed-off-by: helenxie-bit <[email protected]> * debug Signed-off-by: helenxie-bit <[email protected]> * check import path of HuggingFaceModelParams Signed-off-by: helenxie-bit <[email protected]> * update the version of training operator sdk Signed-off-by: helenxie-bit <[email protected]> * update the name of experiment Signed-off-by: helenxie-bit <[email protected]> * add step of checking pod Signed-off-by: helenxie-bit <[email protected]> * check the logs of pod Signed-off-by: helenxie-bit <[email protected]> * add check Signed-off-by: helenxie-bit <[email protected]> * check reason for imagepullbackoff Signed-off-by: helenxie-bit <[email protected]> * revert timeout limit Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * extend timeout limit Signed-off-by: helenxie-bit <[email protected]> * update training operator sdk version Signed-off-by: helenxie-bit <[email protected]> * check the logs of pod Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * update the function of getting logs Signed-off-by: helenxie-bit <[email protected]> * add the step of describing pod Signed-off-by: helenxie-bit <[email protected]> * check disk space Signed-off-by: helenxie-bit <[email protected]> * change work directory Signed-off-by: helenxie-bit <[email protected]> * change work directory Signed-off-by: helenxie-bit <[email protected]> * increase timeout limit Signed-off-by: helenxie-bit <[email protected]> * check the logs of controller and events Signed-off-by: helenxie-bit <[email protected]> * change work directory Signed-off-by: helenxie-bit <[email protected]> * change work directory Signed-off-by: helenxie-bit <[email protected]> * change work directory Signed-off-by: helenxie-bit <[email protected]> * check the logs of kubelet Signed-off-by: helenxie-bit <[email protected]> * check the logs of kubelet Signed-off-by: helenxie-bit <[email protected]> * increase cpu Signed-off-by: helenxie-bit <[email protected]> * check the logs of training operator Signed-off-by: helenxie-bit <[email protected]> * check the use of resources Signed-off-by: helenxie-bit <[email protected]> * check the logs of container 'pytorch' and 'storage_initializer' Signed-off-by: helenxie-bit <[email protected]> * fix error of checking use of resources Signed-off-by: helenxie-bit <[email protected]> * add other checks to find the error reason Signed-off-by: helenxie-bit <[email protected]> * set 'storage_config' Signed-off-by: helenxie-bit <[email protected]> * reduce the number of tests Signed-off-by: helenxie-bit <[email protected]> * Check container runtime logs Signed-off-by: helenxie-bit <[email protected]> * set the driver of minikube as docker Signed-off-by: helenxie-bit <[email protected]> * set the driver of minikube to none Signed-off-by: helenxie-bit <[email protected]> * check logs of pod Signed-off-by: helenxie-bit <[email protected]> * check memory usage Signed-off-by: helenxie-bit <[email protected]> * increase 'termination_grace_period_seconds' in podspec Signed-off-by: helenxie-bit <[email protected]> * fix annotations error Signed-off-by: helenxie-bit <[email protected]> * restart docker Signed-off-by: helenxie-bit <[email protected]> * delete restarting docker Signed-off-by: helenxie-bit <[email protected]> * use original docker data directory Signed-off-by: helenxie-bit <[email protected]> * update installation of Katib SDK with extra requires Signed-off-by: helenxie-bit <[email protected]> * test trainer image built with cpu Signed-off-by: helenxie-bit <[email protected]> * add action of free up disk space (including move docker data directory) Signed-off-by: helenxie-bit <[email protected]> * delete unnecessary checks and update the part of fetching pod description and logs Signed-off-by: helenxie-bit <[email protected]> * delete fetching pod logs Signed-off-by: helenxie-bit <[email protected]> * add blank line at the end of free-up-disk-space yaml file Signed-off-by: helenxie-bit <[email protected]> * update experiment name Signed-off-by: helenxie-bit <[email protected]> * update test function name to be consistent with experiment name Signed-off-by: helenxie-bit <[email protected]> * move import statements inside the function Signed-off-by: helenxie-bit <[email protected]> * apply pprint for the logging output Signed-off-by: helenxie-bit <[email protected]> * update experiment names Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix the sequence of arguments in 'trial_template' Signed-off-by: helenxie-bit <[email protected]> * test example in user guide Signed-off-by: helenxie-bit <[email protected]> * fix access token error Signed-off-by: helenxie-bit <[email protected]> * fix the error of setup Signed-off-by: helenxie-bit <[email protected]> * fix the error of setup Signed-off-by: helenxie-bit <[email protected]> * reverse back Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> --------- Signed-off-by: helenxie-bit <[email protected]>
1 parent 5cd9592 commit 73b8c5c

File tree

5 files changed

+156
-21
lines changed

5 files changed

+156
-21
lines changed

.github/workflows/e2e-test-tune-api.yaml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,12 @@ jobs:
2121
uses: ./.github/workflows/template-setup-e2e-test
2222
with:
2323
kubernetes-version: ${{ matrix.kubernetes-version }}
24-
24+
25+
- name: Install Katib SDK with extra requires
26+
shell: bash
27+
run: |
28+
pip install --prefer-binary -e 'sdk/python/v1beta1[huggingface]'
29+
2530
- name: Run e2e test with tune API
2631
uses: ./.github/workflows/template-e2e-test
2732
with:
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
name: Free-Up Disk Space
2+
description: Remove Non-Essential Tools And Move Docker Data Directory to /mnt/docker
3+
4+
runs:
5+
using: composite
6+
steps:
7+
# This step is a Workaround to avoid the "No space left on device" error.
8+
# ref: https://github.com/actions/runner-images/issues/2840
9+
- name: Remove unnecessary files
10+
shell: bash
11+
run: |
12+
echo "Disk usage before cleanup:"
13+
df -hT
14+
15+
sudo rm -rf /usr/share/dotnet
16+
sudo rm -rf /opt/ghc
17+
sudo rm -rf /usr/local/share/boost
18+
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
19+
sudo rm -rf /usr/local/lib/android
20+
sudo rm -rf /usr/local/share/powershell
21+
sudo rm -rf /usr/share/swift
22+
23+
echo "Disk usage after cleanup:"
24+
df -hT
25+
26+
- name: Prune docker images
27+
shell: bash
28+
run: |
29+
docker image prune -a -f
30+
docker system df
31+
df -hT
32+
33+
- name: Move docker data directory
34+
shell: bash
35+
run: |
36+
echo "Stopping docker service ..."
37+
sudo systemctl stop docker
38+
DOCKER_DEFAULT_ROOT_DIR=/var/lib/docker
39+
DOCKER_ROOT_DIR=/mnt/docker
40+
echo "Moving ${DOCKER_DEFAULT_ROOT_DIR} -> ${DOCKER_ROOT_DIR}"
41+
sudo mv ${DOCKER_DEFAULT_ROOT_DIR} ${DOCKER_ROOT_DIR}
42+
echo "Creating symlink ${DOCKER_DEFAULT_ROOT_DIR} -> ${DOCKER_ROOT_DIR}"
43+
sudo ln -s ${DOCKER_ROOT_DIR} ${DOCKER_DEFAULT_ROOT_DIR}
44+
echo "$(sudo ls -l ${DOCKER_DEFAULT_ROOT_DIR})"
45+
echo "Starting docker service ..."
46+
sudo systemctl daemon-reload
47+
sudo systemctl start docker
48+
echo "Docker service status:"
49+
sudo systemctl --no-pager -l -o short status docker

.github/workflows/template-setup-e2e-test/action.yaml

Lines changed: 2 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -17,19 +17,8 @@ runs:
1717
steps:
1818
# This step is a Workaround to avoid the "No space left on device" error.
1919
# ref: https://github.com/actions/runner-images/issues/2840
20-
- name: Remove unnecessary files
21-
shell: bash
22-
run: |
23-
sudo rm -rf /usr/share/dotnet
24-
sudo rm -rf /opt/ghc
25-
sudo rm -rf "/usr/local/share/boost"
26-
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
27-
sudo rm -rf /usr/local/lib/android
28-
sudo rm -rf /usr/local/share/powershell
29-
sudo rm -rf /usr/share/swift
30-
31-
echo "Disk usage after cleanup:"
32-
df -h
20+
- name: Free-Up Disk Space
21+
uses: ./.github/workflows/free-up-disk-space
3322

3423
- name: Setup kubectl
3524
uses: azure/setup-kubectl@v4

sdk/python/v1beta1/kubeflow/katib/api/katib_client.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -692,8 +692,8 @@ class name in this argument.
692692
retain_trials,
693693
trial_parameters,
694694
resources_per_trial,
695-
worker_pod_template_spec,
696695
master_pod_template_spec,
696+
worker_pod_template_spec,
697697
)
698698

699699
# Add parameters to the Katib Experiment.

test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py

Lines changed: 98 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
import argparse
22
import logging
3+
from pprint import pformat
34

5+
import kubeflow.katib as katib
46
from kubeflow.katib import KatibClient, search
57
from kubeflow.katib.types.types import TrainerResources
68
from kubernetes import client
@@ -12,7 +14,6 @@
1214
# The default logging config.
1315
logging.basicConfig(level=logging.INFO)
1416

15-
1617
def run_e2e_experiment_create_by_tune(
1718
katib_client: KatibClient,
1819
exp_name: str,
@@ -53,9 +54,8 @@ def objective(parameters):
5354
verify_experiment_results(katib_client, experiment, exp_name, exp_namespace)
5455

5556
# Print the Experiment and Suggestion.
56-
logging.debug(katib_client.get_experiment(exp_name, exp_namespace))
57-
logging.debug(katib_client.get_suggestion(exp_name, exp_namespace))
58-
57+
logging.debug("Experiment:\n%s", pformat(katib_client.get_experiment(exp_name, exp_namespace)))
58+
logging.debug("Suggestion:\n%s", pformat(katib_client.get_suggestion(exp_name, exp_namespace)))
5959

6060
def run_e2e_experiment_create_by_tune_pytorchjob(
6161
katib_client: KatibClient,
@@ -115,9 +115,85 @@ def objective(parameters):
115115
verify_experiment_results(katib_client, experiment, exp_name, exp_namespace)
116116

117117
# Print the Experiment and Suggestion.
118-
logging.debug(katib_client.get_experiment(exp_name, exp_namespace))
119-
logging.debug(katib_client.get_suggestion(exp_name, exp_namespace))
118+
logging.debug("Experiment:\n%s", pformat(katib_client.get_experiment(exp_name, exp_namespace)))
119+
logging.debug("Suggestion:\n%s", pformat(katib_client.get_suggestion(exp_name, exp_namespace)))
120+
121+
def run_e2e_experiment_create_by_tune_with_llm_optimization(
122+
katib_client: KatibClient,
123+
exp_name: str,
124+
exp_namespace: str,
125+
):
126+
import transformers
127+
from kubeflow.storage_initializer.hugging_face import (
128+
HuggingFaceDatasetParams,
129+
HuggingFaceModelParams,
130+
HuggingFaceTrainerParams,
131+
)
132+
from peft import LoraConfig
133+
134+
# Create Katib Experiment and wait until it is finished.
135+
logging.debug("Creating Experiment: {}/{}".format(exp_namespace, exp_name))
136+
137+
# Use the test case from fine-tuning API tutorial.
138+
# https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/
139+
# Create Katib Experiment.
140+
# And Wait until Experiment reaches Succeeded condition.
141+
katib_client.tune(
142+
name=exp_name,
143+
namespace=exp_namespace,
144+
# BERT model URI and type of Transformer to train it.
145+
model_provider_parameters=HuggingFaceModelParams(
146+
model_uri="hf://google-bert/bert-base-cased",
147+
transformer_type=transformers.AutoModelForSequenceClassification,
148+
num_labels=5,
149+
),
150+
# In order to save test time, use 8 samples from Yelp dataset.
151+
dataset_provider_parameters=HuggingFaceDatasetParams(
152+
repo_id="yelp_review_full",
153+
split="train[:8]",
154+
),
155+
# Specify HuggingFace Trainer parameters.
156+
trainer_parameters=HuggingFaceTrainerParams(
157+
training_parameters=transformers.TrainingArguments(
158+
output_dir="test_tune_api",
159+
save_strategy="no",
160+
learning_rate = search.double(min=1e-05, max=5e-05),
161+
num_train_epochs=1,
162+
),
163+
# Set LoRA config to reduce number of trainable model parameters.
164+
lora_config=LoraConfig(
165+
r = search.int(min=8, max=32),
166+
lora_alpha=8,
167+
lora_dropout=0.1,
168+
bias="none",
169+
),
170+
),
171+
objective_metric_name = "train_loss",
172+
objective_type = "minimize",
173+
algorithm_name = "random",
174+
max_trial_count = 1,
175+
parallel_trial_count = 1,
176+
resources_per_trial=katib.TrainerResources(
177+
num_workers=1,
178+
num_procs_per_worker=1,
179+
resources_per_worker={"cpu": "2", "memory": "10G",},
180+
),
181+
storage_config={
182+
"size": "10Gi",
183+
"access_modes": ["ReadWriteOnce"],
184+
},
185+
retain_trials=True,
186+
)
187+
experiment = katib_client.wait_for_experiment_condition(
188+
exp_name, exp_namespace, timeout=EXPERIMENT_TIMEOUT
189+
)
190+
191+
# Verify the Experiment results.
192+
verify_experiment_results(katib_client, experiment, exp_name, exp_namespace)
120193

194+
# Print the Experiment and Suggestion.
195+
logging.debug("Experiment:\n%s", pformat(katib_client.get_experiment(exp_name, exp_namespace)))
196+
logging.debug("Suggestion:\n%s", pformat(katib_client.get_suggestion(exp_name, exp_namespace)))
121197

122198
if __name__ == "__main__":
123199
parser = argparse.ArgumentParser()
@@ -189,3 +265,19 @@ def objective(parameters):
189265
logging.info("---------------------------------------------------------------")
190266
logging.info("---------------------------------------------------------------")
191267
katib_client.delete_experiment(exp_name, exp_namespace)
268+
269+
exp_name = "tune-example-llm-optimization"
270+
exp_namespace = args.namespace
271+
try:
272+
run_e2e_experiment_create_by_tune_with_llm_optimization(katib_client, exp_name, exp_namespace)
273+
logging.info("---------------------------------------------------------------")
274+
logging.info(f"E2E is succeeded for Experiment created by tune: {exp_namespace}/{exp_name}")
275+
except Exception as e:
276+
logging.info("---------------------------------------------------------------")
277+
logging.info(f"E2E is failed for Experiment created by tune: {exp_namespace}/{exp_name}")
278+
raise e
279+
finally:
280+
# Delete the Experiment.
281+
logging.info("---------------------------------------------------------------")
282+
logging.info("---------------------------------------------------------------")
283+
katib_client.delete_experiment(exp_name, exp_namespace)

0 commit comments

Comments
 (0)