Skip to content

Commit 87b7e7d

Browse files
authored
Add Conformance Program Doc for AutoML and Training WG (#2048)
* Add Conformance Program Doc for AutoML and Training WG * Address Review Comments
1 parent 01b59a4 commit 87b7e7d

File tree

2 files changed

+147
-0
lines changed

2 files changed

+147
-0
lines changed
77.3 KB
Loading

docs/proposals/conformance-test.md

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# Conformance Test for AutoML and Training Working Group
2+
3+
Andrey Velichkevich ([@andreyvelich](https://github.com/andreyvelich))
4+
Johnu George ([@johnugeorge](https://github.com/johnugeorge))
5+
2022-11-21
6+
[Original Google Doc](https://docs.google.com/document/d/1TRUKUY1zCCMdgF-nJ7QtzRwifsoQop0V8UnRo-GWlpI/edit#).
7+
8+
## Motivation
9+
10+
Kubeflow community needs to design conformance program so the distributions can
11+
become
12+
[Certified Kubeflow](https://docs.google.com/document/d/1a9ufoe_6DB1eSjpE9eK5nRBoH3ItoSkbPfxRA0AjPIc/edit?resourcekey=0-IRtbQzWfw5L_geRJ7F7GWQ#).
13+
Recently, Kubeflow Pipelines Working Group (WG) implemented the first version of
14+
[their conformance tests](https://github.com/kubeflow/kubeflow/issues/6485).
15+
We should design the same program for AutoML and Training WG.
16+
17+
This document is based on the original proposal for
18+
[the Kubeflow Pipelines conformance program](https://docs.google.com/document/d/1_til1HkVBFQ1wCgyUpWuMlKRYI4zP1YPmNxr75mzcps/edit#).
19+
20+
## Objective
21+
22+
Conformance program for AutoML and Training WG should follow the same goals as Pipelines program:
23+
24+
- The tests should be fully automated and executable by anyone who has public
25+
access to the Kubeflow repository.
26+
- The test results should be easy to verify by the Kubeflow Conformance Committee.
27+
- The tests should not depend on cloud provider (e.g. AWS or GCP).
28+
- The tests should cover basic functionality of Katib and the Training Operator.
29+
It will not cover all features.
30+
- The tests are expected to evolve in the future versions.
31+
- The tests should have a well documented and short list of set-up requirements.
32+
- The tests should install and complete in a relatively short period of time
33+
with suggested minimum infrastructure requirements
34+
(e.g. 3 nodes, 24 vCPU, 64 GB RAM, 500 GB Disk).
35+
36+
## Kubeflow Conformance
37+
38+
Initially the Kubeflow conformance will include the CRD based tests.
39+
In the future, API and UI based tests may be added. Kubeflow conformance consists
40+
the 3 category of tests:
41+
42+
- CRD-based tests
43+
44+
Most of Katib and Training Operator functionality are based on Kubernetes CRD.
45+
46+
**This document will define a design for CRD-based tests for Katib and the Training Operator.**
47+
48+
- API-based tests
49+
50+
Currently, Katib or Training Operator doesn’t have an API server that receives
51+
requests from the users. However, Katib has the DB Manager component that is
52+
responsible for writing/reading ML Training metrics.
53+
54+
In the following versions, we should design conformance program for the
55+
Katib API-based tests.
56+
57+
- UI-based tests
58+
59+
UI tests are valuable but complex to design, document and execute. In the following
60+
versions, we should design conformance program for the Katib UI-based tests.
61+
62+
## Design for the CRD-based tests
63+
64+
![conformance-crd-test](../images/conformance-crd-test.png)
65+
66+
The design is similar to the KFP conformance program for the API-based tests.
67+
68+
For Katib, tests will be based on
69+
[the `run-e2e-experiment.go` script](https://github.com/kubeflow/katib/blob/570a3e68fff7b963889692d54ee1577fbf65e2ef/test/e2e/v1beta1/hack/gh-actions/run-e2e-experiment.go)
70+
that we run for our e2e tests.
71+
72+
This script will be converted to use Katib SDK. Tracking issue: https://github.com/kubeflow/katib/issues/2024.
73+
74+
For the Training Operator, tests will be based on [the SDK e2e test.](https://github.com/kubeflow/training-operator/tree/05badc6ee8a071400efe9019d8d60fc242818589/sdk/python/test/e2e)
75+
76+
### Test Workflow
77+
78+
All tests will be run in the _kf-conformance_ namespace inside the separate container.
79+
That will help to avoid environment variance and improve fault tolerance. Driver is required to trigger the deployment and download the results.
80+
81+
- We are going to use
82+
[the unified Makefile](https://github.com/kubeflow/kubeflow/blob/2fa0d3665234125aeb8cebe8fe44f0a5a50791c5/conformance/1.5/Makefile)
83+
for all Kubeflow conformance tests. Distributions (_driver_ on the diagram)
84+
need to run the following Makefile commands:
85+
86+
```makefile
87+
88+
# Run the conformance program.
89+
run: setup run-katib run-training-operator
90+
91+
# Sets up the Kubernetes resources (Kubeflow Profile, RBAC) that needs to run the test.
92+
# Create temporary folder for the conformance report.
93+
setup:
94+
kubectl apply -f ./setup.yaml
95+
mkdir -p /tmp/kf-conformance
96+
97+
# Create deployment and run the e2e tests for Katib and Training Operator.
98+
run-katib:
99+
kubectl apply -f ./katib-conformance.yaml
100+
101+
run-training-operator:
102+
kubectl apply -f ./training-operator-conformance.yaml
103+
104+
# Download the test deployment results to create PR for the Kubeflow Conformance Committee.
105+
report:
106+
./report-conformance.sh
107+
108+
# Cleans up created resources and directories.
109+
cleanup:
110+
kubectl delete -f ./setup.yaml
111+
kubectl delete -f ./katib-conformance.yaml
112+
kubectl delete -f ./training-operator-conformance.yaml
113+
rm -rf /tmp/kf-conformance
114+
```
115+
116+
- Katib and Training Operator conformance deployment will have the appropriate
117+
RBAC to Create/Read/Delete Katib Experiment and Training Operator Jobs in the
118+
_kf-conformance_ namespace.
119+
120+
- Distribution should have access to the internet to download the training datasets
121+
(e.g. MNIST) while running the tests.
122+
123+
- When the job is finished, the script generates output.
124+
125+
For Katib Experiment the output should be as follows:
126+
127+
```
128+
Test 1 - passed.
129+
Experiment name: random-search
130+
Experiment status: Experiment has succeeded because max trial count has reached
131+
```
132+
133+
For Training Operator the output should be as follows:
134+
135+
```
136+
Test 1 - passed.
137+
TFJob name: tfjob-mnist
138+
TFJob status: TFJob tfjob-mnist is successfully completed.
139+
```
140+
141+
- The above report can be downloaded from the test deployment by running `make report`.
142+
143+
- When all reports have been collected, the distributions are going to create PR
144+
to publish the reports and to update the appropriate [Kubeflow Documentation](https://www.kubeflow.org/)
145+
on conformant Kubeflow distributions. The Kubeflow Conformance Committee will
146+
verify it and make the distribution
147+
[Certified Kubeflow](https://github.com/kubeflow/community/blob/master/proposals/kubeflow-conformance-program-proposal.md#overview).

0 commit comments

Comments
 (0)