Skip to content

Commit b6f7cfd

Browse files
[SDK] test: Add e2e test for tune function. (#2399)
* fix(sdk): fix error field metrics_collector in tune function. Signed-off-by: Electronic-Waste <[email protected]> * test(sdk): Add e2e tests for tune function. Signed-off-by: Electronic-Waste <[email protected]> * test(sdk): add missing field parameters. Signed-off-by: Electronic-Waste <[email protected]> * refactor(test/sdk): add run-e2e-tune-api.py. Signed-off-by: Electronic-Waste <[email protected]> * test(sdk): delete tune testing code in run-e2e-experiment. Signed-off-by: Electronic-Waste <[email protected]> * test(sdk): add blank lines. Signed-off-by: Electronic-Waste <[email protected]> * test(sdk): add verbose and temporarily delete e2e-experiment test. Signed-off-by: Electronic-Waste <[email protected]> * test(sdk): add namespace_labels. Signed-off-by: Electronic-Waste <[email protected]> * test(sdk): add time.sleep(5). Signed-off-by: Electronic-Waste <[email protected]> * test(sdk): add error output. Signed-off-by: Electronic-Waste <[email protected]> * test(sdk): build random image for tune. Signed-off-by: Electronic-Waste <[email protected]> * test(sdk): delete extra debug log. Signed-off-by: Electronic-Waste <[email protected]> * refactor(test/sdk): create separate workflow for tune. Signed-off-by: Electronic-Waste <[email protected]> * test(sdk): change api to API. Signed-off-by: Electronic-Waste <[email protected]> * test(sdk): change the permission of scripts. Signed-off-by: Electronic-Waste <[email protected]> * test(sdk): delete exit code & comment image pulling. Signed-off-by: Electronic-Waste <[email protected]> * test(sdk): delete image pulling phase. Signed-off-by: Electronic-Waste <[email protected]> * test(sdk): refactor workflow file to use template. Signed-off-by: Electronic-Waste <[email protected]> * test(sdk): mark experiments and trial-images as not required. Signed-off-by: Electronic-Waste <[email protected]> * test(sdk): pass tune-api param to setup-minikube.sh. Signed-off-by: Electronic-Waste <[email protected]> * test(sdk): fix err in template-e2e-test. Signed-off-by: Electronic-Waste <[email protected]> * test(sdk): add debug logs. Signed-off-by: Electronic-Waste <[email protected]> * test(sdk): reorder params and delete logs. Signed-off-by: Electronic-Waste <[email protected]> --------- Signed-off-by: Electronic-Waste <[email protected]>
1 parent 51b246f commit b6f7cfd

File tree

9 files changed

+341
-149
lines changed

9 files changed

+341
-149
lines changed
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
name: E2E Test with tune API
2+
3+
on:
4+
pull_request:
5+
paths-ignore:
6+
- "pkg/ui/v1beta1/frontend/**"
7+
8+
concurrency:
9+
group: ${{ github.workflow }}-${{ github.ref }}
10+
cancel-in-progress: true
11+
12+
jobs:
13+
e2e:
14+
runs-on: ubuntu-22.04
15+
timeout-minutes: 120
16+
steps:
17+
- name: Checkout
18+
uses: actions/checkout@v4
19+
20+
- name: Setup Test Env
21+
uses: ./.github/workflows/template-setup-e2e-test
22+
with:
23+
kubernetes-version: ${{ matrix.kubernetes-version }}
24+
25+
- name: Run e2e test with tune API
26+
uses: ./.github/workflows/template-e2e-test
27+
with:
28+
tune-api: true
29+
30+
strategy:
31+
fail-fast: false
32+
matrix:
33+
# Detail: https://hub.docker.com/r/kindest/node
34+
kubernetes-version: ["v1.27.11", "v1.28.7", "v1.29.2"]

.github/workflows/template-e2e-test/action.yaml

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,17 @@ description: Run e2e test using the minikube cluster
44

55
inputs:
66
experiments:
7-
required: true
7+
required: false
88
description: comma delimited experiment name
9+
default: ""
910
training-operator:
1011
required: false
1112
description: whether to deploy training-operator or not
1213
default: false
1314
trial-images:
14-
required: true
15+
required: false
1516
description: comma delimited trial image name
17+
default: ""
1618
katib-ui:
1719
required: true
1820
description: whether to deploy katib-ui or not
@@ -21,18 +23,27 @@ inputs:
2123
required: false
2224
description: mysql or postgres
2325
default: mysql
26+
tune-api:
27+
required: true
28+
description: whether to execute tune-api test or not
29+
default: false
2430

2531
runs:
2632
using: composite
2733
steps:
2834
- name: Setup Minikube Cluster
2935
shell: bash
30-
run: ./test/e2e/v1beta1/scripts/gh-actions/setup-minikube.sh ${{ inputs.katib-ui }} ${{ inputs.trial-images }} ${{ inputs.experiments }}
36+
run: ./test/e2e/v1beta1/scripts/gh-actions/setup-minikube.sh ${{ inputs.katib-ui }} ${{ inputs.tune-api }} ${{ inputs.trial-images }} ${{ inputs.experiments }}
3137

3238
- name: Setup Katib
3339
shell: bash
3440
run: ./test/e2e/v1beta1/scripts/gh-actions/setup-katib.sh ${{ inputs.katib-ui }} ${{ inputs.training-operator }} ${{ inputs.database-type }}
3541

3642
- name: Run E2E Experiment
3743
shell: bash
38-
run: ./test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.sh ${{ inputs.experiments }}
44+
run: |
45+
if "${{ inputs.tune-api }}"; then
46+
./test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.sh
47+
else
48+
./test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.sh ${{ inputs.experiments }}
49+
fi

sdk/python/v1beta1/kubeflow/katib/api/katib_client.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -386,7 +386,7 @@ def tune(
386386

387387
# Add metrics collector to the Katib Experiment.
388388
# Up to now, We only support parameter `kind`, of which default value is `StdOut`, to specify the kind of metrics collector.
389-
experiment.spec.metrics_collector = models.V1beta1MetricsCollectorSpec(
389+
experiment.spec.metrics_collector_spec = models.V1beta1MetricsCollectorSpec(
390390
collector=models.V1beta1CollectorSpec(kind=metrics_collector_config["kind"])
391391
)
392392

test/e2e/v1beta1/scripts/gh-actions/build-load.sh

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,9 +25,10 @@ pushd .
2525
cd "$(dirname "$0")/../../../../.."
2626
trap popd EXIT
2727

28-
TRIAL_IMAGES=${1:-""}
29-
EXPERIMENTS=${2:-""}
30-
DEPLOY_KATIB_UI=${3:-false}
28+
DEPLOY_KATIB_UI=${1:-false}
29+
TUNE_API=${2:-false}
30+
TRIAL_IMAGES=${3:-""}
31+
EXPERIMENTS=${4:-""}
3132

3233
REGISTRY="docker.io/kubeflowkatib"
3334
TAG="e2e-test"
@@ -162,6 +163,12 @@ for name in "${TRIAL_IMAGE_ARRAY[@]}"; do
162163
run "$name" "examples/$VERSION/trial-images/$name/Dockerfile"
163164
done
164165

166+
# Testing image for tune function
167+
if "$TUNE_API"; then
168+
echo -e "\nPulling and building testing image for tune function..."
169+
_build_containers "suggestion-hyperopt" "$CMD_PREFIX/suggestion/hyperopt/$VERSION/Dockerfile"
170+
fi
171+
165172
echo -e "\nCleanup Build Cache...\n"
166173
docker buildx prune -f
167174

test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.py

Lines changed: 1 addition & 138 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
import argparse
22
import logging
3-
import time
43

54
from kubeflow.katib import ApiClient
65
from kubeflow.katib import KatibClient
76
from kubeflow.katib import models
87
from kubeflow.katib.constants import constants
98
from kubeflow.katib.utils.utils import FakeResponse
109
from kubernetes import client
10+
from verify import verify_experiment_results
1111
import yaml
1212

1313
# Experiment timeout is 40 min.
@@ -17,143 +17,6 @@
1717
logging.basicConfig(level=logging.INFO)
1818

1919

20-
def verify_experiment_results(
21-
katib_client: KatibClient,
22-
experiment: models.V1beta1Experiment,
23-
exp_name: str,
24-
exp_namespace: str,
25-
):
26-
27-
# Get the best objective metric.
28-
best_objective_metric = None
29-
for metric in experiment.status.current_optimal_trial.observation.metrics:
30-
if metric.name == experiment.spec.objective.objective_metric_name:
31-
best_objective_metric = metric
32-
break
33-
34-
if best_objective_metric is None:
35-
raise Exception(
36-
"Unable to get the best metrics for objective: {}. Current Optimal Trial: {}".format(
37-
experiment.spec.objective.objective_metric_name,
38-
experiment.status.current_optimal_trial,
39-
)
40-
)
41-
42-
# Get Experiment Succeeded reason.
43-
for c in experiment.status.conditions:
44-
if (
45-
c.type == constants.EXPERIMENT_CONDITION_SUCCEEDED
46-
and c.status == constants.CONDITION_STATUS_TRUE
47-
):
48-
succeeded_reason = c.reason
49-
break
50-
51-
trials_completed = experiment.status.trials_succeeded or 0
52-
trials_completed += experiment.status.trials_early_stopped or 0
53-
max_trial_count = experiment.spec.max_trial_count
54-
55-
# If Experiment is Succeeded because of Max Trial Reached, all Trials must be completed.
56-
if (
57-
succeeded_reason == "ExperimentMaxTrialsReached"
58-
and trials_completed != max_trial_count
59-
):
60-
raise Exception(
61-
"All Trials must be Completed. Max Trial count: {}, Experiment status: {}".format(
62-
max_trial_count, experiment.status
63-
)
64-
)
65-
66-
# If Experiment is Succeeded because of Goal reached, the metrics must be correct.
67-
if succeeded_reason == "ExperimentGoalReached" and (
68-
(
69-
experiment.spec.objective.type == "minimize"
70-
and float(best_objective_metric.min) > float(experiment.spec.objective.goal)
71-
)
72-
or (
73-
experiment.spec.objective.type == "maximize"
74-
and float(best_objective_metric.max) < float(experiment.spec.objective.goal)
75-
)
76-
):
77-
raise Exception(
78-
"Experiment goal is reached, but metrics are incorrect. "
79-
f"Experiment objective: {experiment.spec.objective}. "
80-
f"Experiment best objective metric: {best_objective_metric}"
81-
)
82-
83-
# Verify Suggestion's resources. Suggestion name = Experiment name.
84-
suggestion = katib_client.get_suggestion(exp_name, exp_namespace)
85-
86-
# For the Never or FromVolume resume policies Suggestion must be Succeeded.
87-
# For the LongRunning resume policy Suggestion must be always Running.
88-
for c in suggestion.status.conditions:
89-
if (
90-
c.type == constants.EXPERIMENT_CONDITION_SUCCEEDED
91-
and c.status == constants.CONDITION_STATUS_TRUE
92-
and experiment.spec.resume_policy == "LongRunning"
93-
):
94-
raise Exception(
95-
f"Suggestion is Succeeded while Resume Policy is {experiment.spec.resume_policy}."
96-
f"Suggestion conditions: {suggestion.status.conditions}"
97-
)
98-
elif (
99-
c.type == constants.EXPERIMENT_CONDITION_RUNNING
100-
and c.status == constants.CONDITION_STATUS_TRUE
101-
and experiment.spec.resume_policy != "LongRunning"
102-
):
103-
raise Exception(
104-
f"Suggestion is Running while Resume Policy is {experiment.spec.resume_policy}."
105-
f"Suggestion conditions: {suggestion.status.conditions}"
106-
)
107-
108-
# For Never and FromVolume resume policies verify Suggestion's resources.
109-
if (
110-
experiment.spec.resume_policy == "Never"
111-
or experiment.spec.resume_policy == "FromVolume"
112-
):
113-
resource_name = exp_name + "-" + experiment.spec.algorithm.algorithm_name
114-
115-
# Suggestion's Service and Deployment should be deleted.
116-
for i in range(10):
117-
try:
118-
client.AppsV1Api().read_namespaced_deployment(
119-
resource_name, exp_namespace
120-
)
121-
except client.ApiException as e:
122-
if e.status == 404:
123-
break
124-
else:
125-
raise e
126-
# Deployment deletion might take some time.
127-
time.sleep(1)
128-
if i == 10:
129-
raise Exception(
130-
"Suggestion Deployment is still alive for Resume Policy: {}".format(
131-
experiment.spec.resume_policy
132-
)
133-
)
134-
135-
try:
136-
client.CoreV1Api().read_namespaced_service(resource_name, exp_namespace)
137-
except client.ApiException as e:
138-
if e.status != 404:
139-
raise e
140-
else:
141-
raise Exception(
142-
"Suggestion Service is still alive for Resume Policy: {}".format(
143-
experiment.spec.resume_policy
144-
)
145-
)
146-
147-
# For FromVolume resume policy PVC should not be deleted.
148-
if experiment.spec.resume_policy == "FromVolume":
149-
try:
150-
client.CoreV1Api().read_namespaced_persistent_volume_claim(
151-
resource_name, exp_namespace
152-
)
153-
except client.ApiException:
154-
raise Exception("PVC is deleted for FromVolume Resume Policy")
155-
156-
15720
def run_e2e_experiment(
15821
katib_client: KatibClient,
15922
experiment: models.V1beta1Experiment,
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
import argparse
2+
import logging
3+
4+
from kubeflow.katib import KatibClient
5+
from kubeflow.katib import search
6+
from kubernetes import client
7+
from verify import verify_experiment_results
8+
9+
# Experiment timeout is 40 min.
10+
EXPERIMENT_TIMEOUT = 60 * 40
11+
12+
# The default logging config.
13+
logging.basicConfig(level=logging.INFO)
14+
15+
16+
def run_e2e_experiment_create_by_tune(
17+
katib_client: KatibClient,
18+
exp_name: str,
19+
exp_namespace: str,
20+
):
21+
# Create Katib Experiment and wait until it is finished.
22+
logging.debug("Creating Experiment: {}/{}".format(exp_namespace, exp_name))
23+
24+
# Use the test case from get-started tutorial.
25+
# https://www.kubeflow.org/docs/components/katib/getting-started/#getting-started-with-katib-python-sdk
26+
# [1] Create an objective function.
27+
def objective(parameters):
28+
import time
29+
time.sleep(5)
30+
result = 4 * int(parameters["a"]) - float(parameters["b"]) ** 2
31+
print(f"result={result}")
32+
33+
# [2] Create hyperparameter search space.
34+
parameters = {
35+
"a": search.int(min=10, max=20),
36+
"b": search.double(min=0.1, max=0.2)
37+
}
38+
39+
# [3] Create Katib Experiment with 4 Trials and 2 CPUs per Trial.
40+
# And Wait until Experiment reaches Succeeded condition.
41+
katib_client.tune(
42+
name=exp_name,
43+
namespace=exp_namespace,
44+
objective=objective,
45+
parameters=parameters,
46+
objective_metric_name="result",
47+
max_trial_count=4,
48+
resources_per_trial={"cpu": "2"},
49+
)
50+
experiment = katib_client.wait_for_experiment_condition(
51+
exp_name, exp_namespace, timeout=EXPERIMENT_TIMEOUT
52+
)
53+
54+
# Verify the Experiment results.
55+
verify_experiment_results(katib_client, experiment, exp_name, exp_namespace)
56+
57+
# Print the Experiment and Suggestion.
58+
logging.debug(katib_client.get_experiment(exp_name, exp_namespace))
59+
logging.debug(katib_client.get_suggestion(exp_name, exp_namespace))
60+
61+
62+
if __name__ == "__main__":
63+
parser = argparse.ArgumentParser()
64+
parser.add_argument(
65+
"--namespace", type=str, required=True, help="Namespace for the Katib E2E test",
66+
)
67+
parser.add_argument(
68+
"--verbose", action="store_true", help="Verbose output for the Katib E2E test",
69+
)
70+
args = parser.parse_args()
71+
72+
if args.verbose:
73+
logging.getLogger().setLevel(logging.DEBUG)
74+
75+
katib_client = KatibClient()
76+
77+
namespace_labels = client.CoreV1Api().read_namespace(args.namespace).metadata.labels
78+
if 'katib.kubeflow.org/metrics-collector-injection' not in namespace_labels:
79+
namespace_labels['katib.kubeflow.org/metrics-collector-injection'] = 'enabled'
80+
client.CoreV1Api().patch_namespace(args.namespace, {'metadata': {'labels': namespace_labels}})
81+
82+
# Test with run_e2e_experiment_create_by_tune
83+
exp_name = "tune-example"
84+
exp_namespace = args.namespace
85+
try:
86+
run_e2e_experiment_create_by_tune(katib_client, exp_name, exp_namespace)
87+
logging.info("---------------------------------------------------------------")
88+
logging.info(f"E2E is succeeded for Experiment created by tune: {exp_namespace}/{exp_name}")
89+
except Exception as e:
90+
logging.info("---------------------------------------------------------------")
91+
logging.info(f"E2E is failed for Experiment created by tune: {exp_namespace}/{exp_name}")
92+
raise e
93+
finally:
94+
# Delete the Experiment.
95+
logging.info("---------------------------------------------------------------")
96+
logging.info("---------------------------------------------------------------")
97+
katib_client.delete_experiment(exp_name, exp_namespace)

0 commit comments

Comments
 (0)