[Proposal] Support cluster failover for stateful workloads #6495

XiShanYongYe-Chang · 2025-06-26T13:40:27Z

What type of PR is this?

/kind feature
/kind documentation

What this PR does / why we need it:

Support cluster failover for stateful workloads.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

karmada-bot · 2025-06-26T13:40:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from xishanyongye-chang. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

XiShanYongYe-Chang · 2025-06-26T13:42:25Z

/cc @mszacillo

codecov-commenter · 2025-06-26T13:55:59Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 45.34%. Comparing base (b8f6874) to head (f80c613).
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6495      +/-   ##
==========================================
- Coverage   45.34%   45.34%   -0.01%     
==========================================
  Files         687      687              
  Lines       56393    56393              
==========================================
- Hits        25571    25569       -2     
- Misses      29226    29228       +2     
  Partials     1596     1596

Flag	Coverage Δ
unittests	`45.34% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

XiShanYongYe-Chang · 2025-07-01T01:25:57Z

Hi @mszacillo, any comments?

RainbowMango · 2025-07-16T01:49:00Z

/gemini-review

gemini-code-assist · 2025-07-16T01:49:43Z

Hello @RainbowMango,

Thank you for initiating the review with /gemini-review. I've thoroughly reviewed the proposal document docs/proposals/failover/stateful-workload-cluster-failover.md which outlines the design for supporting cluster failover for stateful workloads.

Overall, the proposal is well-structured, clearly articulates the problem, and presents a logical extension of existing functionalities within Karmada. The API modifications are minimal and seem to integrate well with the current PropagationPolicy and ResourceBinding structures. The Flink use case provides a helpful concrete example.

Here are a few comments and suggestions based on my review:

General Observations & Questions

User Intervention in Flink Use Case (Steps 3 & 4 of Flink use case):
The proposal states that after Karmada provides the jobId (Step 2), the user is responsible for retrieving checkpoint data and injecting the initialSavepointPath into the migrated FlinkDeployment. While this design effectively passes the necessary state information, it introduces a manual step for the user.
- Question: Is there a long-term vision or a future proposal planned to automate this initialSavepointPath generation and injection process, perhaps through a more integrated mechanism or a custom controller, to further enhance the user experience for stateful workload failover?
Completeness of "Risks and Mitigations":
The "Risks and Mitigations" section currently states "None". However, the "System Failure Impact and Mitigation" section correctly identifies a limitation: "If users split the replicas of a stateful application and propagate them across different clusters, and the system migrates only some replicas, the stateful application failover recovery feature may not work effectively."
- Suggestion: For clarity and consistency, it would be beneficial to explicitly list this known limitation under the "Risks and Mitigations" section as well. This ensures all potential challenges are consolidated in one place.
Performance and Scalability Considerations for Stateful Workloads:
The "Graduation Criteria" mentions "Scalability and performance testing" for GA. While this proposal focuses on the core mechanism, stateful workloads often involve significant data transfer or re-initialization.
- Suggestion: Consider adding a brief note in the "Design Details" or "System Failure Impact and Mitigation" section about potential performance implications or challenges related to large state sizes or network latency during failover for stateful applications. Even if addressing these is a "No-Goal" for this specific PR, acknowledging them demonstrates foresight and can guide future work.

API Design

The proposed API changes (FailoverBehavior.Cluster, ClusterFailoverBehavior.StatePreservation, StatePreservation, StatePreservationRule) are clear and seem appropriate for the intended functionality. Reusing PreservedLabelState and ClustersBeforeFailover in ResourceBinding is a good design choice that minimizes API surface area.

Testing

The test plan is clear: unit tests for new functions and E2E tests for stateful workload recovery after cluster failover. This is essential for validating the new functionality.

Overall, this is a solid proposal that addresses a critical need for stateful workloads in multi-cluster environments. Once the above points are considered, I believe this PR will be in excellent shape.

Let me know if you have any questions or would like to discuss these points further.

RainbowMango · 2025-07-16T01:49:44Z

docs/proposals/failover/stateful-workload-cluster-failover.md

+
+## Summary
+
+The advantage of stateless workloads is their higher fault tolerance, as the loss of workloads does not affect user sessions. However, for stateful workloads (such as Flink), the loss of workloads may result in the loss of session data, preventing the workload from recovering correctly.


What's the session data thing here?

Lift it from #5116

It may be:

User Session Information: This includes data such as the user's login status, session identifiers, session start time, and other session-related metadata.

Temporary Data: Data that the application generates while processing user requests, which needs to be consistent across multiple requests.

State Information: The state that the application maintains during its operation, such as intermediate computation results, checkpoints, and other runtime data.

Would it be easier to understand if we use persistent data?

How about:

The advantage of stateless workloads is their higher fault tolerance, as the loss of workloads does not impact application correctness or require data recovery. In contrast, stateful workloads (like Flink) depend on checkpointed or saved state to recover after failures. If that state is lost, the job may not be able to resume from where it left off.

Thanks @mszacillo

RainbowMango · 2025-07-16T01:52:24Z

docs/proposals/failover/stateful-workload-cluster-failover.md

+
+The advantage of stateless workloads is their higher fault tolerance, as the loss of workloads does not affect user sessions. However, for stateful workloads (such as Flink), the loss of workloads may result in the loss of session data, preventing the workload from recovering correctly.
+
+For workloads, failover scenarios include two cases: the first is when the workload itself fails and needs to be migrated to another cluster; the second is when the cluster encounter a failure, such as a network partition, requiring some workloads within the cluster to be migrated.


Does network partition mean the network of a specific cluster is blocked?
I doubt it's the typical scenarios that should trigger the cluster failover.

network partition is a term, and we can refer to the explanation provided by Wikipedia.

Let me provie a few more typical scenarios:

e.g., node crashes, network partitions, or control plane outages

RainbowMango · 2025-07-16T01:56:46Z

docs/proposals/failover/stateful-workload-cluster-failover.md

+
+Karmada currently supports propagating various types of resources, including Kubernetes objects and CRDs, which is particularly useful for ensuring the resilience of stateful workloads in multi-cluster environments. However, the Karmada system's processing logic is based on the assumption that the propagated resources are stateless. In failover scenarios, users may want to preserve a certain state before failure so that workloads can resume from the point where they stopped in the previous cluster.
+
+For CRDs that handle data processing (such as Flink or Spark), restarting the application from a previous checkpoint can be particularly useful. This allows the application to seamlessly resume processing from the point of interruption while avoiding duplicate processing. Karmada v1.12 supports stateful workload migration after application failures, and the next step is to extend support for stateful workload migration following cluster failures.


CRD can't handle data processing...

RainbowMango · 2025-07-16T01:57:55Z

docs/proposals/failover/stateful-workload-cluster-failover.md

+
+## Motivation
+
+Karmada currently supports propagating various types of resources, including Kubernetes objects and CRDs, which is particularly useful for ensuring the resilience of stateful workloads in multi-cluster environments. However, the Karmada system's processing logic is based on the assumption that the propagated resources are stateless. In failover scenarios, users may want to preserve a certain state before failure so that workloads can resume from the point where they stopped in the previous cluster.


Karmada currently supports propagating various types of resources, including Kubernetes objects and CRDs, which is particularly useful for ensuring the resilience of stateful workloads in multi-cluster environments.

I don't get why supports various resources is useful for stateful workloads.

Lift it from #5116

Let me revise the wording.

docs/proposals/failover/stateful-workload-cluster-failover.md

RainbowMango · 2025-07-16T02:16:36Z

docs/proposals/failover/stateful-workload-cluster-failover.md

+What situations may cause the feature to fail, and will it cause system crashes?
+
+As mentioned above, this proposal focuses on scenarios where all replicas of a stateful application are migrated together. If users split the replicas of a stateful application and propagate them across different clusters, and the system migrates only some replicas, the stateful application failover recovery feature may not work effectively.
+
+This feature will not cause system crashes.


We won't accept any feature that would cause the system to crash; that's common sense, so it's not necessary to explain.

As mentioned above, this proposal focuses on scenarios where all replicas of a stateful application are migrated together. If users split the replicas of a stateful application and propagate them across different clusters, and the system migrates only some replicas, the stateful application failover recovery feature may not work effectively.

This is the limitation or constraint of this feature, I would prefer to put it to Notes/Constraints/Caveats (Optional).

Another question is, can the system identify whether a statefull workload can be split across clusters, considering the use case of propagation ETCD replicas to multiple clusters.

This is the limitation or constraint of this feature, I would prefer to put it to Notes/Constraints/Caveats (Optional).

+1

Another question is, can the system identify whether a statefull workload can be split across clusters, considering the use case of propagation ETCD replicas to multiple clusters.

Sorry, I didn't quite understand the question.

RainbowMango · 2025-07-16T02:19:04Z

docs/proposals/failover/stateful-workload-cluster-failover.md

+- The `StatePreservation` definition in the `PropagationPolicy` API will transition to Beta;
+- The `PreservedLabelState` and `ClustersBeforeFailover` fields in the `ResourceBinding` API will transition to Beta;


When speaking of graduation criteria, we are saying the maturity of a feature, not an API field...

RainbowMango · 2025-07-16T02:20:09Z

docs/proposals/failover/stateful-workload-cluster-failover.md

+
+- The `StatePreservation` definition in the `PropagationPolicy` API will transition to Beta;
+- The `PreservedLabelState` and `ClustersBeforeFailover` fields in the `ResourceBinding` API will transition to Beta;
+- End-to-end (E2E) testing has been completed.


How many E2E tests are planned for this feature? I mean, how to know the E2E testing is completed?

Supplemented in the Test Planning section.

RainbowMango · 2025-07-16T02:22:44Z

docs/proposals/failover/stateful-workload-cluster-failover.md

+In this proposal, we have uncommented the `Cluster` field in the `FailoverBehavior` struct to define behaviors related to cluster failover. We have also added the `StatePreservation` field to the `ClusterFailoverBehavior` struct, which is used to specify the policy for preserving and restoring state data of stateful applications during failover events. The type of this field is exactly the same as the `StatePreservation` field in the `ApplicationFailoverBehavior` struct.
+
+The `ResourceBinding` API already meets the design goals and can directly reuse the `PreservedLabelState` and `ClustersBeforeFailover` fields in the `gracefulEvictionTask` without introducing new modifications.


I don't see where explains the PurgeMode thing.

mszacillo · 2025-07-16T09:26:14Z

docs/proposals/failover/stateful-workload-cluster-failover.md

+
+    // Cluster indicates failover behaviors in case of cluster failure.
+    // +optional
+    Cluster *ClusterFailoverBehavior `json:"cluster,omitempty"`


Apologies for the delay here, I have been traveling recently.

I don’t have strong objections to introducing ClusterFailoverBehavior as a separate field. That said, from my perspective, do we expect users to define different state preservation between cluster and application failover?

Right now, we're running on an internal commit that simply reuses the same state preservation defined under FailoverBehavior during cluster failover events, and that’s worked well so far.

I think defining a separate ClusterFailoverBehavior makes more sense once the cluster failover logic is more fully developed — for example, if users can define custom cluster failover criteria. At the moment, including state preservation options in both places feels somewhat redundant. But I also understand if this is just a first step towards having a more formally defined ClusterFailover feature, so maybe we can address this later.

Thanks @mszacillo

Apologies for the delay here, I have been traveling recently.

It sounds wonderful. 👍

At the moment, including state preservation options in both places feels somewhat redundant. But I also understand if this is just a first step towards having a more formally defined ClusterFailover feature, so maybe we can address this later.

@RainbowMango and I have also discussed this issue. Currently, defining StatePreservation in both places does seem somewhat redundant. However, we are also considering another point: we may not yet have sufficient scenarios to prove that the methods of obtaining information during application migration caused by cluster failures and application failures are exactly the same.

Therefore, this redundancy might be an intermediate state that will be eliminated in the future; it could also be validated by new scenarios and become the final state.

I would like to understand whether the current redundancy design is convenient to modify for the solution you are currently running.

However, we are also considering another point: we may not yet have sufficient scenarios to prove that the methods of obtaining information during application migration caused by cluster failures and application failures are exactly the same.

Yes, it's exactly the reason. We don't know if the requirements are exactly the same between application failover and cluster failover. Maybe they require different labels to identify the reason.

@XiShanYongYe-Chang This is a good question, and I think we should explain it in this proposal.

XiShanYongYe-Chang · 2025-07-17T09:32:07Z

Hi @RainbowMango @mszacillo An updated version has been made based on the comments. Can you help review it again?

XiShanYongYe-Chang · 2025-07-21T09:47:40Z

Hi @kevin-wangzefeng, can you help take a review?
/cc @kevin-wangzefeng

Signed-off-by: changzhen <[email protected]>

karmada-bot added kind/feature Categorizes issue or PR as related to a new feature. kind/documentation Categorizes issue or PR as related to documentation. labels Jun 26, 2025

karmada-bot requested review from RainbowMango and Tingtal June 26, 2025 13:40

karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 26, 2025

XiShanYongYe-Chang mentioned this pull request Jun 26, 2025

[Umbrella] Optimize Cluster Failover: import new effect taint and cluster failover strategy #6428

Open

33 tasks

karmada-bot requested a review from mszacillo June 26, 2025 13:42

XiShanYongYe-Chang force-pushed the support-stateful-cluster-failover branch from 63e48b0 to 9510e81 Compare July 15, 2025 01:52

RainbowMango reviewed Jul 16, 2025

View reviewed changes

mszacillo reviewed Jul 16, 2025

View reviewed changes

XiShanYongYe-Chang force-pushed the support-stateful-cluster-failover branch from 9510e81 to f6a6f41 Compare July 17, 2025 09:31

XiShanYongYe-Chang force-pushed the support-stateful-cluster-failover branch 2 times, most recently from ffff931 to 46a19b5 Compare July 17, 2025 09:39

karmada-bot requested a review from kevin-wangzefeng July 21, 2025 09:47

RainbowMango mentioned this pull request Jul 26, 2025

Introduced Directly, Gracefully PurgeMode, and deprecated Immediately and Graciously PurgeMode #6387

Open

XiShanYongYe-Chang mentioned this pull request Aug 2, 2025

[Umbrella] support cluster failover for stateful workloads #6593

Open

12 tasks

add proposal: Support cluster failover for stateful workloads

f80c613

Signed-off-by: changzhen <[email protected]>

XiShanYongYe-Chang force-pushed the support-stateful-cluster-failover branch from 46a19b5 to f80c613 Compare August 2, 2025 09:30


		## Summary

		The advantage of stateless workloads is their higher fault tolerance, as the loss of workloads does not affect user sessions. However, for stateful workloads (such as Flink), the loss of workloads may result in the loss of session data, preventing the workload from recovering correctly.


		The advantage of stateless workloads is their higher fault tolerance, as the loss of workloads does not affect user sessions. However, for stateful workloads (such as Flink), the loss of workloads may result in the loss of session data, preventing the workload from recovering correctly.

		For workloads, failover scenarios include two cases: the first is when the workload itself fails and needs to be migrated to another cluster; the second is when the cluster encounter a failure, such as a network partition, requiring some workloads within the cluster to be migrated.


		Karmada currently supports propagating various types of resources, including Kubernetes objects and CRDs, which is particularly useful for ensuring the resilience of stateful workloads in multi-cluster environments. However, the Karmada system's processing logic is based on the assumption that the propagated resources are stateless. In failover scenarios, users may want to preserve a certain state before failure so that workloads can resume from the point where they stopped in the previous cluster.

		For CRDs that handle data processing (such as Flink or Spark), restarting the application from a previous checkpoint can be particularly useful. This allows the application to seamlessly resume processing from the point of interruption while avoiding duplicate processing. Karmada v1.12 supports stateful workload migration after application failures, and the next step is to extend support for stateful workload migration following cluster failures.


		## Motivation

		Karmada currently supports propagating various types of resources, including Kubernetes objects and CRDs, which is particularly useful for ensuring the resilience of stateful workloads in multi-cluster environments. However, the Karmada system's processing logic is based on the assumption that the propagated resources are stateless. In failover scenarios, users may want to preserve a certain state before failure so that workloads can resume from the point where they stopped in the previous cluster.

		- The `StatePreservation` definition in the `PropagationPolicy` API will transition to Beta;
		- The `PreservedLabelState` and `ClustersBeforeFailover` fields in the `ResourceBinding` API will transition to Beta;

		In this proposal, we have uncommented the `Cluster` field in the `FailoverBehavior` struct to define behaviors related to cluster failover. We have also added the `StatePreservation` field to the `ClusterFailoverBehavior` struct, which is used to specify the policy for preserving and restoring state data of stateful applications during failover events. The type of this field is exactly the same as the `StatePreservation` field in the `ApplicationFailoverBehavior` struct.

		The `ResourceBinding` API already meets the design goals and can directly reuse the `PreservedLabelState` and `ClustersBeforeFailover` fields in the `gracefulEvictionTask` without introducing new modifications.

[Proposal] Support cluster failover for stateful workloads #6495

Are you sure you want to change the base?

[Proposal] Support cluster failover for stateful workloads #6495

Uh oh!

Conversation

XiShanYongYe-Chang commented Jun 26, 2025

Uh oh!

karmada-bot commented Jun 26, 2025

Uh oh!

XiShanYongYe-Chang commented Jun 26, 2025

Uh oh!

codecov-commenter commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

XiShanYongYe-Chang commented Jul 1, 2025

Uh oh!

RainbowMango commented Jul 16, 2025

Uh oh!

gemini-code-assist bot commented Jul 16, 2025

General Observations & Questions

API Design

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XiShanYongYe-Chang Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XiShanYongYe-Chang Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XiShanYongYe-Chang commented Jul 17, 2025

Uh oh!

XiShanYongYe-Chang commented Jul 21, 2025

Uh oh!

Uh oh!

codecov-commenter commented Jun 26, 2025 •

edited

Loading

XiShanYongYe-Chang Jul 16, 2025 •

edited

Loading

XiShanYongYe-Chang Jul 16, 2025 •

edited

Loading