Skip to content

Commit 3f57061

Browse files
authored
Update failure policy KEP with proposed RecreateJob behavior (#925)
* Update KEP with proposed RecreateJob behavior * Move RecreateReplicatedJob back to future work * Review comment fixes
1 parent 84698cd commit 3f57061

File tree

1 file changed

+43
-122
lines changed

1 file changed

+43
-122
lines changed

keps/262-ConfigurableFailurePolicy/README.md

Lines changed: 43 additions & 122 deletions
Original file line numberDiff line numberDiff line change
@@ -278,6 +278,44 @@ spec:
278278
python3 train.py
279279
```
280280

281+
#### Story 5: Recreating individual jobs on failure rather than failing JobSet (`RecreateJob`)
282+
283+
If it is possible for individual worker Jobs within a ReplicatedJob to be restarted independently on failure
284+
without requiring a full restart of their parent ReplicatedJob or the entire JobSet, the `RecreateJob`
285+
failure policy can be used.
286+
287+
**Example Failure Policy configuration for this use case**:
288+
289+
```yaml
290+
apiVersion: jobset.x-k8s.io/v1alpha2
291+
kind: JobSet
292+
metadata:
293+
name: recreate-job-example
294+
spec:
295+
# Failure Policy to restart individual jobs of the target ReplicatedJob (recoverable-workers) if they fail,
296+
# without restarting the entire JobSet or other jobs in recoverable-workers.
297+
failurePolicy:
298+
rules:
299+
- action: RecreateJob
300+
targetReplicatedJobs:
301+
- recoverable-workers
302+
maxRestarts: 10
303+
replicatedJobs:
304+
- name: recoverable-workers
305+
replicas: 2
306+
template:
307+
spec:
308+
parallelism: 1
309+
completions: 1
310+
backoffLimit: 0
311+
template:
312+
spec:
313+
restartPolicy: Never
314+
containers:
315+
- name: main
316+
image: python:3.10
317+
command: ["..."]
318+
``
281319
282320
### Notes/Constraints/Caveats (Optional)
283321
@@ -322,13 +360,16 @@ const (
322360

323361
// Don't count the failure against maxRestarts.
324362
RestartJobSetAndIgnoreMaxRestarts FailurePolicyAction = "RestartJobSetAndIgnoreMaxRestarts"
363+
364+
// Recreate the failed Job without restarting the entire JobSet.
365+
RecreateJob FailurePolicyAction = "RecreateJob"
325366
)
326367

327368
// FailurePolicyRule defines a FailurePolicyAction to be executed if a child job
328369
// fails due to a reason listed in OnJobFailureReasons.
329370
type FailurePolicyRule struct {
330371
// The action to take if the rule is matched.
331-
// +kubebuilder:validation:Enum:=FailJobSet;RestartJobSetAndIgnoreMaxRestarts;FailJob;RestartJob
372+
// +kubebuilder:validation:Enum:=FailJobSet;RestartJobSetAndIgnoreMaxRestarts;FailJob;RecreateJob
332373
Action FailurePolicyAction `json:"action"`
333374
// The requirement on the job failure reasons. The requirement
334375
// is satisfied if at least one reason matches the list.
@@ -483,126 +524,6 @@ Additional actions we want to support in the future include:
483524
jobs of the target replicated job, **without incrementing the restart attempt annotation**. The jobs will then be
484525
recreated via the normal reconciliation process.
485526

486-
2) `RestartJob`: To restart a single child job without restarting the entire JobSet, the controller will delete that
487-
particular child job, **without incrementing the restart attempt annotation**, and allow the normal reconciliation
488-
process to recreate it.
489-
490-
3) `FailJob`: To leave a particular child job in a failed state without restarting it or restarting the JobSet, the
527+
2) `FailJob`: To leave a particular child job in a failed state without restarting it or restarting the JobSet, the
491528
controller will simply do nothing, taking no action on this job.
492529

493-
494-
### Story 1: RestartReplicatedJob
495-
496-
As a user, I have a JobSet with 2 replicated jobs: one which runs distributed training processes across a pool of GPU
497-
nodes, and one which runs the driver/coordinator on a CPU pool. If a child job of the GPU worker ReplicatedJob crashes, I just want to restart the GPU workers and not the driver, then resume training from the latest checkpoint. However, if
498-
the driver crashes, I want to restart the entire JobSet, then resume training from the latest checkpoint.
499-
500-
**Example Failure Policy configuration for this use case**:
501-
502-
```yaml
503-
apiVersion: jobset.x-k8s.io/v1alpha2
504-
kind: JobSet
505-
metadata:
506-
name: restart-replicated-job-example
507-
annotations:
508-
alpha.jobset.sigs.k8s.io/exclusive-topology: {{topologyDomain}} # 1:1 job replica to topology domain assignment
509-
spec:
510-
# Failure Policy to restart the child jobs of the target ReplicatedJob (gpu-workers) if any fail, but fall
511-
# back to the default behavior of restarting the entire JobSet if the driver fails.
512-
failurePolicy:
513-
rules:
514-
- action: RestartReplicatedJob
515-
targetReplicatedJobs:
516-
- gpu-workers
517-
maxRestarts: 10
518-
replicatedJobs:
519-
- name: driver
520-
replicas: 1
521-
template:
522-
spec:
523-
parallelism: 1
524-
completions: 1
525-
backoffLimit: 0
526-
template:
527-
spec:
528-
restartPolicy: Never
529-
containers:
530-
- name: main
531-
image: python:3.10
532-
command: ["..."]
533-
- name: gpu-workers
534-
replicas: 4 # number of node pools
535-
template:
536-
spec:
537-
parallelism: 2
538-
completions: 2
539-
backoffLimit: 0
540-
template:
541-
spec:
542-
containers:
543-
- name: main
544-
image: pytorch:latest
545-
command: ["..."]
546-
resources:
547-
limits:
548-
nvidia.com/gpu: 1
549-
```
550-
551-
### Story 2: FailJob and RestartJob
552-
553-
Dependency: https://github.com/kubernetes/kubernetes/issues/122972
554-
555-
As a user, I want to run a HPC simulation in which each child job runs a simulation with different random initial
556-
parameters. When a simulation ends, the application will exit with one of two exit codes:
557-
558-
- Exit code 2, which indicates the simulation produced an invalid result due to bad starting parameters, and should
559-
not be retried.
560-
- Exit code 3, which indicates the simulation produced an invalid result but the intial parameters were reasonable,
561-
so the simulation should be restarted.
562-
563-
When a Job fails due to a pod failing with exit code 2, I want the Job to stay in a failed state.
564-
When a Job fails due to a pod failing with exit code 3, I want to restart the Job.
565-
566-
**Example Failure Policy configuration for this use case**:
567-
568-
```yaml
569-
apiVersion: jobset.x-k8s.io/v1alpha2
570-
kind: JobSet
571-
metadata:
572-
name: restart-replicated-job-example
573-
annotations:
574-
alpha.jobset.sigs.k8s.io/exclusive-topology: {{topologyDomain}} # 1:1 job replica to topology domain assignment
575-
spec:
576-
failurePolicy:
577-
rules:
578-
# If Job fails due to a pod failing with exit code 3, restart that Job.
579-
- action: RestartJob
580-
onJobFailureReasons:
581-
- ExitCode3
582-
# Catch all rule to leave a failed job in the failed state, if it hasn't matched previous rules.
583-
- action: FailJob
584-
replicatedJobs:
585-
- name: simulations
586-
replicas: 10
587-
template:
588-
spec:
589-
parallelism: 1
590-
completions: 1
591-
backoffLimit: 0
592-
# If a pod fails with exit code 3, fail the Job, using the user-defined reason.
593-
podFailurePolicy:
594-
rules:
595-
- action: FailJob
596-
onExitCodes:
597-
containerName: main
598-
operator: In
599-
values: [3]
600-
setConditionReason: "ExitCode3"
601-
template:
602-
spec:
603-
restartPolicy: Never
604-
containers:
605-
- name: main
606-
image: python:3.10
607-
command: ["..."]
608-
```

0 commit comments

Comments
 (0)