Skip to content

Commit c9528e7

Browse files
ram4444andreyvelichtenzen-y
authored
Adding out of the box support to TrainJob (kubeflow#2560)
* Out-of-the-box support TrainJob Signed-off-by: Ram Lau <[email protected]> * Example for Pytorch Distributed Signed-off-by: Ram Lau <[email protected]> * Update examples/v1beta1/kubeflow-training-operator/trainjob-pytorch.yaml Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Ram Lau <[email protected]> * Create folder for Trainer as suggested Signed-off-by: Ram Lau <[email protected]> * Movethe exmaple of trainjob to the new folder Signed-off-by: Ram Lau <[email protected]> * Ref the primaryContainerName to that of ClusterTrainingRuntime Signed-off-by: Ram Lau <[email protected]> * tenzen-y steps down from Katib approver role (kubeflow#2561) Signed-off-by: Yuki Iwai <[email protected]> Signed-off-by: Ram Lau <[email protected]> * Set Default value for TrainJob Success, Failure Condition and PrimaryPodLabels in the trial Template Signed-off-by: Ram Lau <[email protected]> * Enchance Handling for default value of Success, Fail Cond & Pod Label Signed-off-by: Ram Lau <[email protected]> * Bug fix for default value condition Signed-off-by: Ram Lau <[email protected]> * code format by hack/update-gofmt.sh Signed-off-by: Ram Lau <[email protected]> * add TrainJob trial Resources to cert manager config Signed-off-by: Ram Lau <[email protected]> * add trainjob to controller rbac Signed-off-by: Ram Lau <[email protected]> * Grant JobSet permission to Katib controller Signed-off-by: Andrey Velichkevich <[email protected]> * Remove create/delete RBAC for TrainJob Signed-off-by: Andrey Velichkevich <[email protected]> * Fix docker build with libpcre2 Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Ram Lau <[email protected]> Signed-off-by: Yuki Iwai <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> Co-authored-by: Andrey Velichkevich <[email protected]> Co-authored-by: Yuki Iwai <[email protected]>
1 parent fe7a35d commit c9528e7

File tree

9 files changed

+98
-7
lines changed

9 files changed

+98
-7
lines changed

cmd/metricscollector/v1beta1/tfevent-metricscollector/Dockerfile

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,7 @@ ADD ./${METRICS_COLLECTOR_DIR}/ ${TARGET_DIR}/${METRICS_COLLECTOR_DIR}/
1111
WORKDIR ${TARGET_DIR}/${METRICS_COLLECTOR_DIR}
1212

1313
RUN if [ "${TARGETARCH}" = "arm64" ]; then \
14-
apt-get -y update && \
15-
apt-get -y install gfortran libpcre3 libpcre3-dev && \
14+
apt-get -y update && apt-get -y install gfortran libpcre2-dev && \
1615
apt-get clean && \
1716
rm -rf /var/lib/apt/lists/*; \
1817
fi
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
---
2+
apiVersion: kubeflow.org/v1beta1
3+
kind: Experiment
4+
metadata:
5+
namespace: kubeflow
6+
name: torch-distributed-example
7+
spec:
8+
parallelTrialCount: 3
9+
maxTrialCount: 12
10+
maxFailedTrialCount: 3
11+
objective:
12+
type: minimize
13+
goal: 0.001
14+
objectiveMetricName: loss
15+
algorithm:
16+
algorithmName: random
17+
parameters:
18+
- name: lr
19+
parameterType: double
20+
feasibleSpace:
21+
min: "0.01"
22+
max: "0.05"
23+
- name: momentum
24+
parameterType: double
25+
feasibleSpace:
26+
min: "0.5"
27+
max: "0.9"
28+
trialTemplate:
29+
primaryContainerName: node
30+
trialParameters:
31+
- name: learningRate
32+
description: Learning rate for the training model
33+
reference: lr
34+
- name: momentum
35+
description: Momentum for the training model
36+
reference: momentum
37+
trialSpec:
38+
apiVersion: trainer.kubeflow.org/v1alpha1
39+
kind: TrainJob
40+
spec:
41+
runtimeRef:
42+
name: torch-distributed
43+
trainer:
44+
numNodes: 2
45+
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
46+
command:
47+
- "python3"
48+
- "/opt/pytorch-mnist/mnist.py"
49+
- "--epochs=1"
50+
- "--lr=${trialParameters.learningRate}"
51+
- "--momentum=${trialParameters.momentum}"

examples/v1beta1/trial-images/enas-cnn-cifar10/Dockerfile.cpu

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ WORKDIR ${TARGET_DIR}
1010

1111
RUN if [ "${TARGETARCH}" = "arm64" ]; then \
1212
apt-get -y update && \
13-
apt-get -y install gfortran libpcre3 libpcre3-dev && \
13+
apt-get -y install gfortran libpcre2-dev && \
1414
apt-get clean && \
1515
rm -rf /var/lib/apt/lists/*; \
1616
fi

examples/v1beta1/trial-images/tf-mnist-with-summaries/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ WORKDIR /opt/tf-mnist-with-summaries
88

99
RUN if [ "${TARGETARCH}" = "arm64" ]; then \
1010
apt-get -y update && \
11-
apt-get -y install gfortran libpcre3 libpcre3-dev && \
11+
apt-get -y install gfortran libpcre2-dev && \
1212
apt-get clean && \
1313
rm -rf /var/lib/apt/lists/*; \
1414
fi

manifests/v1beta1/components/controller/rbac.yaml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,22 @@ rules:
9090
- "watch"
9191
- "create"
9292
- "delete"
93+
- apiGroups:
94+
- jobset.x-k8s.io
95+
resources:
96+
- jobsets
97+
verbs:
98+
- "get"
99+
- "list"
100+
- "watch"
101+
- apiGroups:
102+
- trainer.kubeflow.org
103+
resources:
104+
- trainjobs
105+
verbs:
106+
- "get"
107+
- "list"
108+
- "watch"
93109
- apiGroups:
94110
- kubeflow.org
95111
resources:

manifests/v1beta1/installs/katib-cert-manager/katib-config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ init:
55
controller:
66
webhookPort: 8443
77
trialResources:
8+
- TrainJob.v1alpha1.trainer.kubeflow.org
89
- Job.v1.batch
910
- TFJob.v1.kubeflow.org
1011
- PyTorchJob.v1.kubeflow.org

manifests/v1beta1/installs/katib-standalone/katib-config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ init:
66
controller:
77
webhookPort: 8443
88
trialResources:
9+
- TrainJob.v1alpha1.trainer.kubeflow.org
910
- Job.v1.batch
1011
- TFJob.v1.kubeflow.org
1112
- PyTorchJob.v1.kubeflow.org

pkg/apis/controller/experiments/v1beta1/constants.go

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,17 +34,27 @@ const (
3434

3535
// DefaultKubeflowJobFailureCondition is the default value of spec.trialTemplate.failureCondition for Kubeflow Training Job.
3636
DefaultKubeflowJobFailureCondition = "status.conditions.#(type==\"Failed\")#|#(status==\"True\")#"
37+
38+
// DefaultTrainJobSuccessCondition is the default value of spec.trialTemplate.successCondition for Training Operator Job.
39+
DefaultTrainJobSuccessCondition = "status.conditions.#(type==\"Complete\")#|#(status==\"True\")#"
40+
41+
// DefaultTrainJobFailureCondition is the default value of spec.trialTemplate.failureCondition for Training Operator Job.
42+
DefaultTrainJobFailureCondition = "status.conditions.#(type==\"Failed\")#|#(status==\"True\")#"
3743
)
3844

3945
var (
4046
// DefaultKubeflowJobPrimaryPodLabels is the default value of spec.trialTemplate.primaryPodLabels for Kubeflow Training Job.
4147
DefaultKubeflowJobPrimaryPodLabels = map[string]string{"training.kubeflow.org/job-role": "master"}
4248

49+
// DefaultKubeflowJobPrimaryPodLabels is the default value of spec.trialTemplate.primaryPodLabels for Training Operator Job.
50+
DefaultTrainJobPrimaryPodLabels = map[string]string{"jobset.sigs.k8s.io/replicatedjob-name": "node", "batch.kubernetes.io/job-completion-index": "0"}
51+
4352
// KubeflowJobKinds is the list of Kubeflow Training Job kinds.
4453
KubeflowJobKinds = map[string]bool{
4554
"TFJob": true,
4655
"PyTorchJob": true,
4756
"XGBoostJob": true,
4857
"MPIJob": true,
58+
"TrainJob": true,
4959
}
5060
)

pkg/apis/controller/experiments/v1beta1/experiment_defaults.go

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -109,14 +109,27 @@ func (e *Experiment) setDefaultTrialTemplate() {
109109
}
110110
} else if _, ok := KubeflowJobKinds[jobKind]; ok {
111111
if t.SuccessCondition == "" {
112-
t.SuccessCondition = DefaultKubeflowJobSuccessCondition
112+
if jobKind == "TrainJob" {
113+
t.SuccessCondition = DefaultTrainJobSuccessCondition
114+
} else {
115+
t.SuccessCondition = DefaultKubeflowJobSuccessCondition
116+
}
113117
}
114118
if t.FailureCondition == "" {
115-
t.FailureCondition = DefaultKubeflowJobFailureCondition
119+
if jobKind == "TrainJob" {
120+
t.FailureCondition = DefaultTrainJobFailureCondition
121+
} else {
122+
t.FailureCondition = DefaultKubeflowJobFailureCondition
123+
}
116124
}
117125
// For Kubeflow Job also set default PrimaryPodLabels
118126
if len(t.PrimaryPodLabels) == 0 {
119-
t.PrimaryPodLabels = DefaultKubeflowJobPrimaryPodLabels
127+
if jobKind == "TrainJob" {
128+
t.PrimaryPodLabels = DefaultTrainJobPrimaryPodLabels
129+
} else {
130+
t.PrimaryPodLabels = DefaultKubeflowJobPrimaryPodLabels
131+
}
132+
120133
}
121134
}
122135
}

0 commit comments

Comments
 (0)