Skip to content

Commit 46a19b5

Browse files
add proposal: Support cluster failover for stateful workloads
Signed-off-by: changzhen <[email protected]>
1 parent c60a2e0 commit 46a19b5

File tree

1 file changed

+385
-0
lines changed

1 file changed

+385
-0
lines changed
Lines changed: 385 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,385 @@
1+
---
2+
title: Support cluster failover for stateful workloads
3+
authors:
4+
- "@XiShanYongYe-Chang"
5+
reviewers:
6+
- "@mszacillo"
7+
- "@Dyex719"
8+
- "@RainbowMango"
9+
approvers:
10+
- "@kevin-wangzefeng"
11+
- "@RainbowMango"
12+
creation-date: 2025-06-26
13+
---
14+
15+
This proposal is modified base on the design of [#5116](https://github.com/karmada-io/karmada/pull/5116).
16+
17+
## Summary
18+
19+
The advantage of stateless workloads is their higher fault tolerance, as the loss of workloads does not impact application correctness or require data recovery. In contrast, stateful workloads (like Flink) depend on checkpointed or saved state to recover after failures. If that state is lost, the job may not be able to resume from where it left off.
20+
21+
For workloads, failover scenarios include two cases: the first is when the workload itself fails and needs to be migrated to another cluster; the second is when the cluster encounter a failure, such as a network partition, requiring some workloads within the cluster to be migrated.
22+
23+
Workload failover may be triggered either by the workload's own failure (requiring migration to another cluster) or by cluster failures (e.g., node crashes, network partitions, or control plane outages) that necessitate relocating affected workloads.
24+
25+
The Karmada community has completed the design and development for the first scenario in version v1.12: [#5788](https://github.com/karmada-io/karmada/issues/5788). This proposal will build upon the previous design experience to support the second scenario, which involves stateful workload failover caused by cluster failures. The API design will unify the handling of common aspects in both scenarios to avoid redundant fields and improve the user experience.
26+
27+
## Motivation
28+
29+
Karmada currently supports propagating various types of resources, including Kubernetes objects and CRDs, this also includes stateful workload, it enabling multi-cluster resilience and elastic scaling for distributed applications. However, the Karmada system's processing logic is based on the assumption that the propagated resources are stateless. In failover scenarios, users may want to preserve a certain state before failure so that workloads can resume from the point where they stopped in the previous cluster.
30+
31+
For CRDs that define resources involved in data processing (such as Flink or Spark), restarting the application from a previous checkpoint can be particularly useful. This allows the application to seamlessly resume processing from the point of interruption while avoiding duplicate processing. Karmada v1.12 supports stateful workload migration after application failures, and the next step is to extend support for stateful workload migration following cluster failures.
32+
33+
### Goals
34+
35+
Support stateful application cluster failover, allowing users to customize the state information that needs to be preserved for workloads in the failed cluster within Karmada. This information is then passed to the migrated workloads, enabling the newly started workloads to recover from the state where the previous cluster stopped.
36+
37+
### No-Goals
38+
39+
- No optimization will be made for cluster failover functionality.
40+
- No optimization will be made for application failover functionality.
41+
42+
## Proposal
43+
44+
### Flink use case
45+
46+
This section helps to provide a basic understanding of how Flink applications work.
47+
48+
In Flink applications, checkpoints are snapshots of the application's state at a specific point in time. They contain information about all records processed up to that moment. This information is continuously persisted to durable storage at regular intervals. To retrieve the latest Flink application state from persistent storage, the `jobId` of the job being recovered and the path to the persisted state are required. Users can then use this information to modify the Flink CRD accordingly to restore from the latest state.
49+
50+
```yaml
51+
jobStatus:
52+
checkpointInfo:
53+
formatType: FULL
54+
lastCheckpoint:
55+
formatType: FULL
56+
timeStamp: 1734462491693
57+
triggerType: PERIODIC
58+
lastPeriodicCheckpointTimestamp: 1734462491693
59+
triggerId: 447b790bb9c88da9ae753f00f45acb0e
60+
triggerTimestamp: 1734462506836
61+
triggerType: PERIODIC
62+
jobId: e6fdb5c0997c11b0c62d796b3df25e86
63+
```
64+
65+
The recovery process for Flink workloads can be summarized as follows:
66+
67+
1. Karmada obtains the `{ .jobStatus.jobId }` from the FlinkDeployment in the failed cluster.
68+
2. Karmada synchronizes the `jobId` to the migrated FlinkDeployment using a label with a `<user-defined key>:<jobId>`.
69+
3. The user retrieves checkpoint data using the `jobId` and generates the initialSavepointPath: `/<shared-path>/<job-namespace>/<jobID>/checkpoints/<checkpoint>`.
70+
4. The user injects the initialSavepointPath information into the migrated FlinkDeployment.
71+
5. The Flink Operator restarts Flink from the savepoint.
72+
73+
### User Stories
74+
75+
#### Story 1
76+
77+
As an administrator or operator, I can define the state preservation strategy for Flink applications deployed on Karmada after a cluster failure. This includes preserving the `jobID` of the Flink application, allowing the application to resume processing from the point of failure using the `jobID` after migration.
78+
79+
## Design Details
80+
81+
This proposal provides users with a method to persist the necessary state of workloads during cluster failover migration. By configuring propagation policies, users can leverage the state information preserved by the Karmada system to restore workloads once they are migrated due to cluster failure. Since stateful workloads may have different ways to recover from the latest state, configurability of this feature is crucial for users.
82+
83+
To support this feature, we need at least:
84+
85+
- A mechanism for stateful workloads to preserve state during cluster failover migration and to reload the preserved state upon workload recovery;
86+
- The ability for users to customize which state information to preserve during the state preservation phase;
87+
- The ability for users to customize how the preserved state information is injected into the migrated workload during the state recovery phase.
88+
89+
> Note: An important detail is that if all replicas of a stateful workload are not migrated together, the timing for state recovery becomes uncertain for the stateful workload. Therefore, this proposal focuses on scenarios where all replicas of a stateful workload are migrated together.
90+
91+
### State Preservation
92+
93+
Key design points for state preservation are as follows:
94+
95+
1. The state information to be preserved is parsed from the workload's `.status` field; state information not present in `.status` cannot be preserved.
96+
2. Users can customize one or more pieces of state information to be preserved for the workload.
97+
3. The state preservation method can be unified for workload migration caused by both cluster failure and application failure.
98+
4. The preserved state information is stored as key-value pairs in the `gracefulEvictionTask` of the `ResourceBinding` resource related to the workload.
99+
100+
Modification of PropagationPolicy API
101+
102+
```go
103+
// FailoverBehavior indicates failover behaviors in case of an application or
104+
// cluster failure.
105+
type FailoverBehavior struct {
106+
// Application indicates failover behaviors in case of application failure.
107+
// If this value is nil, failover is disabled.
108+
// If set, the PropagateDeps should be true so that the dependencies could
109+
// be migrated along with the application.
110+
// +optional
111+
Application *ApplicationFailoverBehavior `json:"application,omitempty"`
112+
113+
// Cluster indicates failover behaviors in case of cluster failure.
114+
// +optional
115+
Cluster *ClusterFailoverBehavior `json:"cluster,omitempty"`
116+
}
117+
118+
// ClusterFailoverBehavior indicates cluster failover behaviors.
119+
type ClusterFailoverBehavior struct {
120+
// PurgeMode represents how to deal with the legacy applications on the
121+
// cluster from which the application is migrated.
122+
// Valid options are "Directly", "Gracefully".
123+
// Defaults to "Gracefully".
124+
// +kubebuilder:validation:Enum=Directly;Gracefully
125+
// +kubebuilder:default=Gracefully
126+
// +optional
127+
PurgeMode PurgeMode `json:"purgeMode,omitempty"`
128+
129+
// StatePreservation defines the policy for preserving and restoring state data
130+
// during failover events for stateful applications.
131+
//
132+
// When an application fails over from one cluster to another, this policy enables
133+
// the extraction of critical data from the original resource configuration.
134+
// Upon successful migration, the extracted data is then re-injected into the new
135+
// resource, ensuring that the application can resume operation with its previous
136+
// state intact.
137+
// This is particularly useful for stateful applications where maintaining data
138+
// consistency across failover events is crucial.
139+
// If not specified, means no state data will be preserved.
140+
//
141+
// Note: This requires the StatefulFailoverInjection feature gate to be enabled,
142+
// which is alpha.
143+
// +optional
144+
StatePreservation *StatePreservation `json:"statePreservation,omitempty"`
145+
}
146+
147+
// StatePreservation defines the policy for preserving state during failover events.
148+
type StatePreservation struct {
149+
// Rules contains a list of StatePreservationRule configurations.
150+
// Each rule specifies a JSONPath expression targeting specific pieces of
151+
// state data to be preserved during failover events. An AliasLabelName is associated
152+
// with each rule, serving as a label key when the preserved data is passed
153+
// to the new cluster.
154+
// +required
155+
Rules []StatePreservationRule `json:"rules"`
156+
}
157+
158+
// StatePreservationRule defines a single rule for state preservation.
159+
// It includes a JSONPath expression and an alias name that will be used
160+
// as a label key when passing state information to the new cluster.
161+
type StatePreservationRule struct {
162+
// AliasLabelName is the name that will be used as a label key when the preserved
163+
// data is passed to the new cluster. This facilitates the injection of the
164+
// preserved state back into the application resources during recovery.
165+
// +required
166+
AliasLabelName string `json:"aliasLabelName"`
167+
168+
// JSONPath is the JSONPath template used to identify the state data
169+
// to be preserved from the original resource configuration.
170+
// The JSONPath syntax follows the Kubernetes specification:
171+
// https://kubernetes.io/docs/reference/kubectl/jsonpath/
172+
//
173+
// Note: The JSONPath expression will start searching from the "status" field of
174+
// the API resource object by default. For example, to extract the "availableReplicas"
175+
// from a Deployment, the JSONPath expression should be "{.availableReplicas}", not
176+
// "{.status.availableReplicas}".
177+
//
178+
// +required
179+
JSONPath string `json:"jsonPath"`
180+
}
181+
```
182+
183+
In this proposal, we have uncommented the `Cluster` field in the `FailoverBehavior` struct to define behaviors related to cluster failover. We have also added the `StatePreservation` field to the `ClusterFailoverBehavior` struct, which is used to specify the policy for preserving and restoring state data of stateful applications during failover events. The type of this field is exactly the same as the `StatePreservation` field in the `ApplicationFailoverBehavior` struct.
184+
185+
In addition, the `PurgeMode` field has been added to the `FailoverBehavior` struct to specify how to migration applications from a cluster. Supported values for `PurgeMode` include `Directly` and `Gracefully`, with the default set to `Gracefully`. However, if the user does not specify the `.spec.failover.cluster` field, the migration strategy will be determined by the value of the `no-execute-taint-eviction-purge-mode` parameter in the `karmada-controller-manager` component.
186+
187+
The `ResourceBinding` API already meets the design goals and can directly reuse the `PreservedLabelState` and `ClustersBeforeFailover` fields in the `gracefulEvictionTask` without introducing new modifications.
188+
189+
#### Example
190+
191+
The following example demonstrates how to configure the state preservation policy using a Flink application:
192+
193+
```yaml
194+
apiVersion: policy.karmada.io/v1alpha1
195+
kind: PropagationPolicy
196+
metadata:
197+
name: foo
198+
spec:
199+
resourceSelectors:
200+
- apiVersion: flink.apache.org/v1beta1
201+
kind: FlinkDeployment
202+
name: foo
203+
placement:
204+
clusterAffinity:
205+
clusterNames:
206+
- member1
207+
- member2
208+
spreadConstraints:
209+
- maxGroups: 1
210+
minGroups: 1
211+
spreadByField: cluster
212+
failover:
213+
cluster:
214+
purgeMode: Directly
215+
statePreservation:
216+
rules:
217+
- aliasLabelName: cluster.karmada.io/failover-jobid
218+
jsonPath: "{ .jobStatus.jobID }"
219+
220+
```
221+
222+
When a Flink application migration is triggered due to a cluster failure, the Karmada controller will, according to the above configuration, extract the jobID from the `.jobStatus.jobID` path in the Flink resource status of the failed cluster. Then, the controller will inject the jobID as a label into the Flink resource on the new cluster after migration, where the label key is `cluster.karmada.io/failover-jobid` and the value is the jobID.
223+
224+
#### Q&A
225+
226+
Q: Why do we use the same `StatePreservation` field in both `ClusterFailoverBehavior` and `ApplicationFailoverBehavior`?
227+
228+
A: Currently, defining `StatePreservation` in both places does seem somewhat redundant. However, we are also considering another point: we may not yet have sufficient scenarios to prove that the methods of obtaining information during application migration caused by cluster failures and application failures are exactly the same. [More question context](https://github.com/karmada-io/karmada/pull/6495#discussion_r2209788288).
229+
230+
### State Recovery
231+
232+
Key design points for state recovery are as follows:
233+
234+
1. State recovery is achieved by injecting the preserved state information as labels into the migrated new workload;
235+
2. The timing of label injection is when the workload is migrated to a new cluster (to confirm the cluster is the result of new scheduling, the previous cluster scheduling information needs to be stored in `gracefulEvictionTask`). During the propagation of the workload, the controller injects the preserved state information into the manifest of the work object.
236+
237+
> Note: The controller logic for state recovery can fully reuse the modifications made previously for application failover to support stateful workloads. No new changes are required in this proposal.
238+
239+
It should be emphasized that when the application's migration mode `PurgeMode` is set to `Directly`, the Karmada system will first remove the application from the original cluster. Only after the application has been completely removed can it be scheduled to a new cluster. The current implementation does not yet meet this requirement and needs to be adjusted. State recovery can be performed after the application has been scheduled to the new cluster.
240+
241+
### Feature Gate
242+
243+
In Karmada v1.12, the `StatefulFailoverInjection` FeatureGate was introduced to control whether state recovery is performed during the recovery process of stateful applications after application failure. In this proposal, we extend the control of this feature gate to cover stateful application recovery scenarios after cluster failure, thereby unifying the control of stateful application failover recovery.
244+
245+
## System Failure Impact and Mitigation
246+
247+
None
248+
249+
## Graduation Criteria
250+
251+
### Alpha to Beta
252+
253+
- Decide whether to remove redundant fields from the `ApplicationFailoverBehavior` and `ClusterFailoverBehavior` structures;
254+
- End-to-end (E2E) testing has been completed, ref [Test Planning](#test-planning);
255+
- Feature Gate `StatefulFailoverInjection` enable by default;
256+
257+
### GA Graduation
258+
259+
- Scalability and performance testing;
260+
- All konwn functional bugs have been fixe;
261+
- Deprecated Feature Gate `StatefulFailoverInjection`;
262+
263+
## Notes/Constraints/Caveats
264+
265+
As mentioned above, this proposal focuses on scenarios where all replicas of a stateful application are migrated together. If users split the replicas of a stateful application and propagate them across clusters, and the system migrates only some replicas, the stateful application failover recovery feature may not work effectively.
266+
267+
## Risks and Mitigations
268+
269+
1. The cluster failover detection is not that accurate yet, and the detaction methods are relatively simple.
270+
2. `Directly` purgeMode might be blocked if the connection between Karmada and the target cluster is broken.
271+
272+
## Impact Analysis on Related Features
273+
274+
No related features are affected at this time.
275+
276+
## Upgrade Impact Analysis
277+
278+
None
279+
280+
## Test Planning
281+
282+
UT: Add unit tests for the newly added functions
283+
E2E: Add end-to-end test cases for stateful workload recovery after cluster failover.
284+
285+
### E2E Use Case Plnning
286+
287+
**Case 1:** A `Deployment` resource is propagated to member clusters via a `PropagationPolicy`. The `PropagationPolicy` resource is configured as follows:
288+
289+
```yaml
290+
apiVersion: policy.karmada.io/v1alpha1
291+
kind: PropagationPolicy
292+
metadata:
293+
name: deploy-propagation
294+
spec:
295+
resourceSelectors:
296+
- apiVersion: apps/v1
297+
kind: Deployment
298+
name: deploy
299+
placement:
300+
clusterAffinity:
301+
clusterNames:
302+
- member1
303+
- member2
304+
spreadConstraints:
305+
- spreadByField: cluster
306+
maxGroups: 1
307+
minGroups: 1
308+
failover:
309+
cluster:
310+
purgeMode: Directly
311+
statePreservation:
312+
rules:
313+
- aliasLabelName: failover.karmada.io/replicas
314+
jsonPath: "{ .replicas }"
315+
- aliasLabelName: failover.karmada.io/readyReplicas
316+
jsonPath: "{ .readyReplicas }"
317+
```
318+
319+
The `Deployment` resource will be propagated to one of the clusters. Then, a `NoExecute` taint will be added to that cluster to trigger the migration of the application. Check whether the target label exists on the migrated `Deployment` resource.
320+
321+
**Case 2:** A `StatefulSet` resource is propagated to member clusters via a `PropagationPolicy`. The `PropagationPolicy` resource is configured as follows:
322+
323+
```yaml
324+
apiVersion: policy.karmada.io/v1alpha1
325+
kind: PropagationPolicy
326+
metadata:
327+
name: sts-propagation
328+
spec:
329+
resourceSelectors:
330+
- apiVersion: apps/v1
331+
kind: StatefulSet
332+
name: sts
333+
placement:
334+
clusterAffinity:
335+
clusterNames:
336+
- member1
337+
- member2
338+
spreadConstraints:
339+
- spreadByField: cluster
340+
maxGroups: 1
341+
minGroups: 1
342+
failover:
343+
cluster:
344+
purgeMode: Directly
345+
statePreservation:
346+
rules:
347+
- aliasLabelName: failover.karmada.io/replicas
348+
jsonPath: "{ .replicas }"
349+
- aliasLabelName: failover.karmada.io/readyReplicas
350+
jsonPath: "{ .readyReplicas }"
351+
```
352+
353+
The `StatefulSet` resource will be propagated to one of the clusters. Then, a `NoExecute` taint will be added to that cluster to trigger the migration of the application. Check whether the target label exists on the migrated `StatefulSet` resource.
354+
355+
**Case 3:** A `FlinkDeployment` resource is propagated to member clusters via a `PropagationPolicy`. The `PropagationPolicy` resource is configured as follows:
356+
357+
```yaml
358+
apiVersion: policy.karmada.io/v1alpha1
359+
kind: PropagationPolicy
360+
metadata:
361+
name: foo
362+
spec:
363+
resourceSelectors:
364+
- apiVersion: flink.apache.org/v1beta1
365+
kind: FlinkDeployment
366+
name: foo
367+
placement:
368+
clusterAffinity:
369+
clusterNames:
370+
- member1
371+
- member2
372+
spreadConstraints:
373+
- maxGroups: 1
374+
minGroups: 1
375+
spreadByField: cluster
376+
failover:
377+
cluster:
378+
purgeMode: Directly
379+
statePreservation:
380+
rules:
381+
- aliasLabelName: failover.karmada.io/failover-jobid
382+
jsonPath: "{ .jobStatus.jobID }"
383+
```
384+
385+
The `FlinkDeployment` resource will be propagated to one of the clusters. Then, a `NoExecute` taint will be added to that cluster to trigger the migration of the application. Check whether the target label exists on the migrated `FlinkDeployment` resource.

0 commit comments

Comments
 (0)