-
Notifications
You must be signed in to change notification settings - Fork 485
Add Conformance Program Doc for AutoML and Training WG #2048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
google-oss-prow
merged 2 commits into
kubeflow:master
from
andreyvelich:add-conformance-doc
Dec 8, 2022
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,147 @@ | ||
# Conformance Test for AutoML and Training Working Group | ||
|
||
Andrey Velichkevich ([@andreyvelich](https://github.com/andreyvelich)) | ||
Johnu George ([@johnugeorge](https://github.com/johnugeorge)) | ||
2022-11-21 | ||
[Original Google Doc](https://docs.google.com/document/d/1TRUKUY1zCCMdgF-nJ7QtzRwifsoQop0V8UnRo-GWlpI/edit#). | ||
|
||
## Motivation | ||
|
||
Kubeflow community needs to design conformance program so the distributions can | ||
become | ||
[Certified Kubeflow](https://docs.google.com/document/d/1a9ufoe_6DB1eSjpE9eK5nRBoH3ItoSkbPfxRA0AjPIc/edit?resourcekey=0-IRtbQzWfw5L_geRJ7F7GWQ#). | ||
Recently, Kubeflow Pipelines Working Group (WG) implemented the first version of | ||
[their conformance tests](https://github.com/kubeflow/kubeflow/issues/6485). | ||
We should design the same program for AutoML and Training WG. | ||
|
||
This document is based on the original proposal for | ||
[the Kubeflow Pipelines conformance program](https://docs.google.com/document/d/1_til1HkVBFQ1wCgyUpWuMlKRYI4zP1YPmNxr75mzcps/edit#). | ||
|
||
## Objective | ||
|
||
Conformance program for AutoML and Training WG should follow the same goals as Pipelines program: | ||
|
||
- The tests should be fully automated and executable by anyone who has public | ||
access to the Kubeflow repository. | ||
- The test results should be easy to verify by the Kubeflow Conformance Committee. | ||
- The tests should not depend on cloud provider (e.g. AWS or GCP). | ||
- The tests should cover basic functionality of Katib and the Training Operator. | ||
It will not cover all features. | ||
- The tests are expected to evolve in the future versions. | ||
- The tests should have a well documented and short list of set-up requirements. | ||
- The tests should install and complete in a relatively short period of time | ||
with suggested minimum infrastructure requirements | ||
(e.g. 3 nodes, 24 vCPU, 64 GB RAM, 500 GB Disk). | ||
|
||
## Kubeflow Conformance | ||
|
||
Initially the Kubeflow conformance will include the CRD based tests. | ||
In the future, API and UI based tests may be added. Kubeflow conformance consists | ||
the 3 category of tests: | ||
|
||
- CRD-based tests | ||
|
||
Most of Katib and Training Operator functionality are based on Kubernetes CRD. | ||
|
||
**This document will define a design for CRD-based tests for Katib and the Training Operator.** | ||
|
||
- API-based tests | ||
|
||
Currently, Katib or Training Operator doesn’t have an API server that receives | ||
requests from the users. However, Katib has the DB Manager component that is | ||
responsible for writing/reading ML Training metrics. | ||
|
||
In the following versions, we should design conformance program for the | ||
Katib API-based tests. | ||
|
||
- UI-based tests | ||
|
||
UI tests are valuable but complex to design, document and execute. In the following | ||
versions, we should design conformance program for the Katib UI-based tests. | ||
|
||
## Design for the CRD-based tests | ||
|
||
 | ||
|
||
The design is similar to the KFP conformance program for the API-based tests. | ||
|
||
For Katib, tests will be based on | ||
[the `run-e2e-experiment.go` script](https://github.com/kubeflow/katib/blob/570a3e68fff7b963889692d54ee1577fbf65e2ef/test/e2e/v1beta1/hack/gh-actions/run-e2e-experiment.go) | ||
that we run for our e2e tests. | ||
|
||
This script will be converted to use Katib SDK. Tracking issue: https://github.com/kubeflow/katib/issues/2024. | ||
|
||
For the Training Operator, tests will be based on [the SDK e2e test.](https://github.com/kubeflow/training-operator/tree/05badc6ee8a071400efe9019d8d60fc242818589/sdk/python/test/e2e) | ||
|
||
### Test Workflow | ||
|
||
All tests will be run in the _kf-conformance_ namespace inside the separate container. | ||
That will help to avoid environment variance and improve fault tolerance. Driver is required to trigger the deployment and download the results. | ||
|
||
- We are going to use | ||
[the unified Makefile](https://github.com/kubeflow/kubeflow/blob/2fa0d3665234125aeb8cebe8fe44f0a5a50791c5/conformance/1.5/Makefile) | ||
for all Kubeflow conformance tests. Distributions (_driver_ on the diagram) | ||
need to run the following Makefile commands: | ||
|
||
```makefile | ||
|
||
# Run the conformance program. | ||
run: setup run-katib run-training-operator | ||
|
||
# Sets up the Kubernetes resources (Kubeflow Profile, RBAC) that needs to run the test. | ||
# Create temporary folder for the conformance report. | ||
setup: | ||
kubectl apply -f ./setup.yaml | ||
mkdir -p /tmp/kf-conformance | ||
|
||
# Create deployment and run the e2e tests for Katib and Training Operator. | ||
run-katib: | ||
kubectl apply -f ./katib-conformance.yaml | ||
andreyvelich marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
run-training-operator: | ||
kubectl apply -f ./training-operator-conformance.yaml | ||
|
||
# Download the test deployment results to create PR for the Kubeflow Conformance Committee. | ||
report: | ||
./report-conformance.sh | ||
|
||
# Cleans up created resources and directories. | ||
cleanup: | ||
kubectl delete -f ./setup.yaml | ||
kubectl delete -f ./katib-conformance.yaml | ||
kubectl delete -f ./training-operator-conformance.yaml | ||
rm -rf /tmp/kf-conformance | ||
``` | ||
|
||
- Katib and Training Operator conformance deployment will have the appropriate | ||
RBAC to Create/Read/Delete Katib Experiment and Training Operator Jobs in the | ||
_kf-conformance_ namespace. | ||
|
||
- Distribution should have access to the internet to download the training datasets | ||
(e.g. MNIST) while running the tests. | ||
|
||
- When the job is finished, the script generates output. | ||
|
||
For Katib Experiment the output should be as follows: | ||
|
||
``` | ||
Test 1 - passed. | ||
Experiment name: random-search | ||
Experiment status: Experiment has succeeded because max trial count has reached | ||
``` | ||
|
||
For Training Operator the output should be as follows: | ||
|
||
``` | ||
Test 1 - passed. | ||
TFJob name: tfjob-mnist | ||
TFJob status: TFJob tfjob-mnist is successfully completed. | ||
``` | ||
|
||
- The above report can be downloaded from the test deployment by running `make report`. | ||
|
||
- When all reports have been collected, the distributions are going to create PR | ||
to publish the reports and to update the appropriate [Kubeflow Documentation](https://www.kubeflow.org/) | ||
on conformant Kubeflow distributions. The Kubeflow Conformance Committee will | ||
verify it and make the distribution | ||
[Certified Kubeflow](https://github.com/kubeflow/community/blob/master/proposals/kubeflow-conformance-program-proposal.md#overview). |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that we can achieve < 30 minutes requirement.
If we are going to run more than 1 Katib Experiment in the future, we might need more time. WDYT @johnugeorge ?
What about Pipelines team @james-jwu ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more idea - The Katib and Training Operator configuration and tests should make attempts to be integrated with the Pipeline configuration and test configuration. (My point is that we should try to minimize the conformance testing configuration and resource requirements if/when possible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pipeline requirement is relatively light. See the below in setup.yaml:
cpu: "2"
memory: 2Gi
requests.storage: "5Gi"
It's been a while since I last ran the Pipeline tests, but they are quite fast (<15 min for sure).
How long does the current Katib and Training tests run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@james-jwu Is resources a mandatory requirement ? We have been running Katib deployment + tests on Github CI which has 2-core CPU and 7G memory. Since allocated resources are bit tight, we have seen that certain runs have exceeded 30 min limit. However, if we have slightly more CPU resources, we can get it in 30 min easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the katib's hyperparameter searching doesn't care much how the each training step goes on actually, we could set very-small epochs or very-small nueral network for conformance test's experiments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the 1st version I think it is okay to require more resources. Jaeyeon's suggestion also sounds great.