Skip to content

Commit 22e4467

Browse files
committed
Add Conformance Program Doc for AutoML and Training WG
1 parent 0d0e77f commit 22e4467

File tree

2 files changed

+140
-0
lines changed

2 files changed

+140
-0
lines changed
77.3 KB
Loading

docs/proposals/conformance-test.md

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# Conformance Test for AutoML and Training Working Group
2+
3+
Andrey Velichkevich ([@andreyvelich](https://github.com/andreyvelich))
4+
Johnu George ([@johnugeorge](https://github.com/johnugeorge))
5+
2022-11-21
6+
[Original Google Doc](https://docs.google.com/document/d/1TRUKUY1zCCMdgF-nJ7QtzRwifsoQop0V8UnRo-GWlpI/edit#).
7+
8+
## Motivation
9+
10+
Kubeflow community needs to design conformance program so the distributions can
11+
become
12+
[Certified Kubeflow](https://docs.google.com/document/d/1a9ufoe_6DB1eSjpE9eK5nRBoH3ItoSkbPfxRA0AjPIc/edit?resourcekey=0-IRtbQzWfw5L_geRJ7F7GWQ#).
13+
Recently, Kubeflow Pipelines Working Group (WG) implemented the first version of
14+
[their conformance tests](https://github.com/kubeflow/kubeflow/issues/6485).
15+
We should design the same program for AutoML and Training WG.
16+
17+
This document is based on the original proposal for
18+
[the Kubeflow Pipelines conformance program](https://docs.google.com/document/d/1_til1HkVBFQ1wCgyUpWuMlKRYI4zP1YPmNxr75mzcps/edit#).
19+
20+
## Objective
21+
22+
Conformance program for AutoML and Training WG should follow the same goals as Pipelines program:
23+
24+
- The tests should be fully automated and executable by anyone who has public
25+
access to the Kubeflow repository.
26+
- The test results should be easy to verify by the Kubeflow Conformance Committee.
27+
- The tests should not depend on cloud provider (e.g. AWS or GCP).
28+
- The tests should cover basic functionality of Katib and the Training Operator.
29+
It will not cover all features.
30+
- The tests are expected to evolve in the future versions.
31+
32+
## Kubeflow Conformance
33+
34+
Kubeflow conformance consists the 3 category of tests:
35+
36+
- API-based tests
37+
38+
Currently, Katib or Training Operator doesn’t have an API server that receives
39+
requests from the users. However, Katib has the DB Manager component that is
40+
responsible for writing/reading ML Training metrics.
41+
42+
In the following versions, we should design conformance program for the
43+
Katib API-based tests.
44+
45+
- CRD-based tests
46+
47+
Most of Katib and Training Operator functionality are based on Kubernetes CRD.
48+
49+
**This document will define a design for CRD-based tests for Katib and the Training Operator.**
50+
51+
- UI-based tests
52+
53+
In the following versions, we should design conformance program for the
54+
Katib UI-based tests.
55+
56+
## Design for the CRD-based tests
57+
58+
![conformance-crd-test](../images/conformance-crd-test.png)
59+
60+
The design is similar to the KFP conformance program for the API-based tests.
61+
62+
For Katib, tests will be based on
63+
[the `run-e2e-experiment.go` script](https://github.com/kubeflow/katib/blob/570a3e68fff7b963889692d54ee1577fbf65e2ef/test/e2e/v1beta1/hack/gh-actions/run-e2e-experiment.go)
64+
that we run for our e2e tests.
65+
66+
This script will be converted to use Katib SDK. Tracking issue: https://github.com/kubeflow/katib/issues/2024.
67+
68+
For the Training Operator, tests will be based on [the SDK e2e test.](https://github.com/kubeflow/training-operator/tree/05badc6ee8a071400efe9019d8d60fc242818589/sdk/python/test/e2e)
69+
70+
### Test Workflow
71+
72+
All tests will be run in the _kf-conformance_ namespace inside the separate container.
73+
That will help to avoid environment variance and improve fault tolerance. Driver is required to trigger the deployment and download the results.
74+
75+
- We are going to use
76+
[the unify Makefile](https://github.com/kubeflow/kubeflow/blob/2fa0d3665234125aeb8cebe8fe44f0a5a50791c5/conformance/1.5/Makefile)
77+
for all Kubeflow conformance tests. Distributions (_driver_ on the diagram)
78+
need to run the following Makefile commands:
79+
80+
```makefile
81+
82+
# Run the conformance program.
83+
run: setup run-katib run-training-operator
84+
85+
# Sets up the Kubernetes resources (Kubeflow Profile, RBAC) that needs to run the test.
86+
# Create temporary folder for the conformance report.
87+
setup:
88+
kubectl apply -f ./setup.yaml
89+
mkdir -p /tmp/kf-conformance
90+
91+
# Create deployment and run the e2e tests for Katib and Training Operator.
92+
run-katib:
93+
kubectl apply -f ./katib-conformance.yaml
94+
95+
run-training-operator:
96+
kubectl apply -f ./training-operator-conformance.yaml
97+
98+
# Download the test deployment results to create PR for the Kubeflow Conformance Committee.
99+
report:
100+
./report-conformance.sh
101+
102+
# Cleans up created resources and directories.
103+
cleanup:
104+
kubectl delete -f ./setup.yaml
105+
kubectl delete -f ./katib-conformance.yaml
106+
kubectl delete -f ./training-operator-conformance.yaml
107+
rm -rf /tmp/kf-conformance
108+
```
109+
110+
- Katib and Training Operator conformance deployment will have the appropriate
111+
RBAC to Create/Read/Delete Katib Experiment and Training Operator Jobs in the
112+
_kf-conformance_ namespace.
113+
114+
- Distribution should have access to the internet to download the training datasets
115+
(e.g. MNIST) while running the tests.
116+
117+
- When the job is finished, the script generates output.
118+
119+
For Katib Experiment the output should be as follows:
120+
121+
```
122+
Test 1 - passed.
123+
Experiment name: random-search
124+
Experiment status: Experiment has succeeded because max trial count has reached
125+
```
126+
127+
For Training Operator the output should be as follows:
128+
129+
```
130+
Test 1 - passed.
131+
TFJob name: tfjob-mnist
132+
TFJob status: TFJob tfjob-mnist is successfully completed.
133+
```
134+
135+
- The above report can be downloaded from the test deployment by running `make report`.
136+
137+
- When all reports have been collected, the distributions are going to create PR
138+
to publish the reports. The Kubeflow Conformance Committee will verify it and
139+
make the distribution
140+
[Certified Kubeflow](https://github.com/kubeflow/community/blob/master/proposals/kubeflow-conformance-program-proposal.md#overview).

0 commit comments

Comments
 (0)