You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
// The requirement on the job failure reasons. The requirement
334
375
// is satisfied if at least one reason matches the list.
@@ -483,126 +524,6 @@ Additional actions we want to support in the future include:
483
524
jobs of the target replicated job, **without incrementing the restart attempt annotation**. The jobs will then be
484
525
recreated via the normal reconciliation process.
485
526
486
-
2)`RestartJob`: To restart a single child job without restarting the entire JobSet, the controller will delete that
487
-
particular child job, **without incrementing the restart attempt annotation**, and allow the normal reconciliation
488
-
process to recreate it.
489
-
490
-
3)`FailJob`: To leave a particular child job in a failed state without restarting it or restarting the JobSet, the
527
+
2)`FailJob`: To leave a particular child job in a failed state without restarting it or restarting the JobSet, the
491
528
controller will simply do nothing, taking no action on this job.
492
529
493
-
494
-
### Story 1: RestartReplicatedJob
495
-
496
-
As a user, I have a JobSet with 2 replicated jobs: one which runs distributed training processes across a pool of GPU
497
-
nodes, and one which runs the driver/coordinator on a CPU pool. If a child job of the GPU worker ReplicatedJob crashes, I just want to restart the GPU workers and not the driver, then resume training from the latest checkpoint. However, if
498
-
the driver crashes, I want to restart the entire JobSet, then resume training from the latest checkpoint.
499
-
500
-
**Example Failure Policy configuration for this use case**:
501
-
502
-
```yaml
503
-
apiVersion: jobset.x-k8s.io/v1alpha2
504
-
kind: JobSet
505
-
metadata:
506
-
name: restart-replicated-job-example
507
-
annotations:
508
-
alpha.jobset.sigs.k8s.io/exclusive-topology: {{topologyDomain}} # 1:1 job replica to topology domain assignment
509
-
spec:
510
-
# Failure Policy to restart the child jobs of the target ReplicatedJob (gpu-workers) if any fail, but fall
511
-
# back to the default behavior of restarting the entire JobSet if the driver fails.
0 commit comments