Skip to content

[Proposal] Support cluster failover for stateful workloads #6495

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

XiShanYongYe-Chang
Copy link
Member

What type of PR is this?

/kind feature
/kind documentation

What this PR does / why we need it:

Support cluster failover for stateful workloads.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@karmada-bot karmada-bot added kind/feature Categorizes issue or PR as related to a new feature. kind/documentation Categorizes issue or PR as related to documentation. labels Jun 26, 2025
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from xishanyongye-chang. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 26, 2025
@XiShanYongYe-Chang
Copy link
Member Author

/cc @mszacillo

@karmada-bot karmada-bot requested a review from mszacillo June 26, 2025 13:42
@codecov-commenter
Copy link

codecov-commenter commented Jun 26, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 45.34%. Comparing base (b8f6874) to head (f80c613).
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6495      +/-   ##
==========================================
- Coverage   45.34%   45.34%   -0.01%     
==========================================
  Files         687      687              
  Lines       56393    56393              
==========================================
- Hits        25571    25569       -2     
- Misses      29226    29228       +2     
  Partials     1596     1596              
Flag Coverage Δ
unittests 45.34% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@XiShanYongYe-Chang
Copy link
Member Author

Hi @mszacillo, any comments?

@XiShanYongYe-Chang XiShanYongYe-Chang force-pushed the support-stateful-cluster-failover branch from 63e48b0 to 9510e81 Compare July 15, 2025 01:52
@RainbowMango
Copy link
Member

/gemini-review

Copy link

Hello @RainbowMango,

Thank you for initiating the review with /gemini-review. I've thoroughly reviewed the proposal document docs/proposals/failover/stateful-workload-cluster-failover.md which outlines the design for supporting cluster failover for stateful workloads.

Overall, the proposal is well-structured, clearly articulates the problem, and presents a logical extension of existing functionalities within Karmada. The API modifications are minimal and seem to integrate well with the current PropagationPolicy and ResourceBinding structures. The Flink use case provides a helpful concrete example.

Here are a few comments and suggestions based on my review:

General Observations & Questions

  • User Intervention in Flink Use Case (Steps 3 & 4 of Flink use case):
    The proposal states that after Karmada provides the jobId (Step 2), the user is responsible for retrieving checkpoint data and injecting the initialSavepointPath into the migrated FlinkDeployment. While this design effectively passes the necessary state information, it introduces a manual step for the user.

    • Question: Is there a long-term vision or a future proposal planned to automate this initialSavepointPath generation and injection process, perhaps through a more integrated mechanism or a custom controller, to further enhance the user experience for stateful workload failover?
  • Completeness of "Risks and Mitigations":
    The "Risks and Mitigations" section currently states "None". However, the "System Failure Impact and Mitigation" section correctly identifies a limitation: "If users split the replicas of a stateful application and propagate them across different clusters, and the system migrates only some replicas, the stateful application failover recovery feature may not work effectively."

    • Suggestion: For clarity and consistency, it would be beneficial to explicitly list this known limitation under the "Risks and Mitigations" section as well. This ensures all potential challenges are consolidated in one place.
  • Performance and Scalability Considerations for Stateful Workloads:
    The "Graduation Criteria" mentions "Scalability and performance testing" for GA. While this proposal focuses on the core mechanism, stateful workloads often involve significant data transfer or re-initialization.

    • Suggestion: Consider adding a brief note in the "Design Details" or "System Failure Impact and Mitigation" section about potential performance implications or challenges related to large state sizes or network latency during failover for stateful applications. Even if addressing these is a "No-Goal" for this specific PR, acknowledging them demonstrates foresight and can guide future work.

API Design

  • The proposed API changes (FailoverBehavior.Cluster, ClusterFailoverBehavior.StatePreservation, StatePreservation, StatePreservationRule) are clear and seem appropriate for the intended functionality. Reusing PreservedLabelState and ClustersBeforeFailover in ResourceBinding is a good design choice that minimizes API surface area.

Testing

  • The test plan is clear: unit tests for new functions and E2E tests for stateful workload recovery after cluster failover. This is essential for validating the new functionality.

Overall, this is a solid proposal that addresses a critical need for stateful workloads in multi-cluster environments. Once the above points are considered, I believe this PR will be in excellent shape.

Let me know if you have any questions or would like to discuss these points further.


## Summary

The advantage of stateless workloads is their higher fault tolerance, as the loss of workloads does not affect user sessions. However, for stateful workloads (such as Flink), the loss of workloads may result in the loss of session data, preventing the workload from recovering correctly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the session data thing here?

Copy link
Member Author

@XiShanYongYe-Chang XiShanYongYe-Chang Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lift it from #5116

It may be:

  • User Session Information: This includes data such as the user's login status, session identifiers, session start time, and other session-related metadata.
  • Temporary Data: Data that the application generates while processing user requests, which needs to be consistent across multiple requests.
  • State Information: The state that the application maintains during its operation, such as intermediate computation results, checkpoints, and other runtime data.

Copy link
Member Author

@XiShanYongYe-Chang XiShanYongYe-Chang Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be easier to understand if we use persistent data?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

The advantage of stateless workloads is their higher fault tolerance, as the loss of workloads does not impact application correctness or require data recovery. In contrast, stateful workloads (like Flink) depend on checkpointed or saved state to recover after failures. If that state is lost, the job may not be able to resume from where it left off.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mszacillo


The advantage of stateless workloads is their higher fault tolerance, as the loss of workloads does not affect user sessions. However, for stateful workloads (such as Flink), the loss of workloads may result in the loss of session data, preventing the workload from recovering correctly.

For workloads, failover scenarios include two cases: the first is when the workload itself fails and needs to be migrated to another cluster; the second is when the cluster encounter a failure, such as a network partition, requiring some workloads within the cluster to be migrated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does network partition mean the network of a specific cluster is blocked?
I doubt it's the typical scenarios that should trigger the cluster failover.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

network partition is a term, and we can refer to the explanation provided by Wikipedia.

Let me provie a few more typical scenarios:

e.g., node crashes, network partitions, or control plane outages


Karmada currently supports propagating various types of resources, including Kubernetes objects and CRDs, which is particularly useful for ensuring the resilience of stateful workloads in multi-cluster environments. However, the Karmada system's processing logic is based on the assumption that the propagated resources are stateless. In failover scenarios, users may want to preserve a certain state before failure so that workloads can resume from the point where they stopped in the previous cluster.

For CRDs that handle data processing (such as Flink or Spark), restarting the application from a previous checkpoint can be particularly useful. This allows the application to seamlessly resume processing from the point of interruption while avoiding duplicate processing. Karmada v1.12 supports stateful workload migration after application failures, and the next step is to extend support for stateful workload migration following cluster failures.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRD can't handle data processing...


## Motivation

Karmada currently supports propagating various types of resources, including Kubernetes objects and CRDs, which is particularly useful for ensuring the resilience of stateful workloads in multi-cluster environments. However, the Karmada system's processing logic is based on the assumption that the propagated resources are stateless. In failover scenarios, users may want to preserve a certain state before failure so that workloads can resume from the point where they stopped in the previous cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Karmada currently supports propagating various types of resources, including Kubernetes objects and CRDs, which is particularly useful for ensuring the resilience of stateful workloads in multi-cluster environments.

I don't get why supports various resources is useful for stateful workloads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lift it from #5116

Let me revise the wording.

Comment on lines 225 to 229
What situations may cause the feature to fail, and will it cause system crashes?

As mentioned above, this proposal focuses on scenarios where all replicas of a stateful application are migrated together. If users split the replicas of a stateful application and propagate them across different clusters, and the system migrates only some replicas, the stateful application failover recovery feature may not work effectively.

This feature will not cause system crashes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't accept any feature that would cause the system to crash; that's common sense, so it's not necessary to explain.

As mentioned above, this proposal focuses on scenarios where all replicas of a stateful application are migrated together. If users split the replicas of a stateful application and propagate them across different clusters, and the system migrates only some replicas, the stateful application failover recovery feature may not work effectively.

  1. This is the limitation or constraint of this feature, I would prefer to put it to Notes/Constraints/Caveats (Optional).
  2. Another question is, can the system identify whether a statefull workload can be split across clusters, considering the use case of propagation ETCD replicas to multiple clusters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the limitation or constraint of this feature, I would prefer to put it to Notes/Constraints/Caveats (Optional).

+1

Another question is, can the system identify whether a statefull workload can be split across clusters, considering the use case of propagation ETCD replicas to multiple clusters.

Sorry, I didn't quite understand the question.

Comment on lines 235 to 236
- The `StatePreservation` definition in the `PropagationPolicy` API will transition to Beta;
- The `PreservedLabelState` and `ClustersBeforeFailover` fields in the `ResourceBinding` API will transition to Beta;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When speaking of graduation criteria, we are saying the maturity of a feature, not an API field...


- The `StatePreservation` definition in the `PropagationPolicy` API will transition to Beta;
- The `PreservedLabelState` and `ClustersBeforeFailover` fields in the `ResourceBinding` API will transition to Beta;
- End-to-end (E2E) testing has been completed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many E2E tests are planned for this feature? I mean, how to know the E2E testing is completed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supplemented in the Test Planning section.

Comment on lines 172 to 216
In this proposal, we have uncommented the `Cluster` field in the `FailoverBehavior` struct to define behaviors related to cluster failover. We have also added the `StatePreservation` field to the `ClusterFailoverBehavior` struct, which is used to specify the policy for preserving and restoring state data of stateful applications during failover events. The type of this field is exactly the same as the `StatePreservation` field in the `ApplicationFailoverBehavior` struct.

The `ResourceBinding` API already meets the design goals and can directly reuse the `PreservedLabelState` and `ClustersBeforeFailover` fields in the `gracefulEvictionTask` without introducing new modifications.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where explains the PurgeMode thing.


// Cluster indicates failover behaviors in case of cluster failure.
// +optional
Cluster *ClusterFailoverBehavior `json:"cluster,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay here, I have been traveling recently.

I don’t have strong objections to introducing ClusterFailoverBehavior as a separate field. That said, from my perspective, do we expect users to define different state preservation between cluster and application failover?

Right now, we're running on an internal commit that simply reuses the same state preservation defined under FailoverBehavior during cluster failover events, and that’s worked well so far.

I think defining a separate ClusterFailoverBehavior makes more sense once the cluster failover logic is more fully developed — for example, if users can define custom cluster failover criteria. At the moment, including state preservation options in both places feels somewhat redundant. But I also understand if this is just a first step towards having a more formally defined ClusterFailover feature, so maybe we can address this later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mszacillo

Apologies for the delay here, I have been traveling recently.

It sounds wonderful. 👍

At the moment, including state preservation options in both places feels somewhat redundant. But I also understand if this is just a first step towards having a more formally defined ClusterFailover feature, so maybe we can address this later.

@RainbowMango and I have also discussed this issue. Currently, defining StatePreservation in both places does seem somewhat redundant. However, we are also considering another point: we may not yet have sufficient scenarios to prove that the methods of obtaining information during application migration caused by cluster failures and application failures are exactly the same.

Therefore, this redundancy might be an intermediate state that will be eliminated in the future; it could also be validated by new scenarios and become the final state.

I would like to understand whether the current redundancy design is convenient to modify for the solution you are currently running.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, we are also considering another point: we may not yet have sufficient scenarios to prove that the methods of obtaining information during application migration caused by cluster failures and application failures are exactly the same.

Yes, it's exactly the reason. We don't know if the requirements are exactly the same between application failover and cluster failover. Maybe they require different labels to identify the reason.

@XiShanYongYe-Chang This is a good question, and I think we should explain it in this proposal.

@XiShanYongYe-Chang XiShanYongYe-Chang force-pushed the support-stateful-cluster-failover branch from 9510e81 to f6a6f41 Compare July 17, 2025 09:31
@XiShanYongYe-Chang
Copy link
Member Author

Hi @RainbowMango @mszacillo An updated version has been made based on the comments. Can you help review it again?

@XiShanYongYe-Chang XiShanYongYe-Chang force-pushed the support-stateful-cluster-failover branch 2 times, most recently from ffff931 to 46a19b5 Compare July 17, 2025 09:39
@XiShanYongYe-Chang
Copy link
Member Author

Hi @kevin-wangzefeng, can you help take a review?
/cc @kevin-wangzefeng

@XiShanYongYe-Chang XiShanYongYe-Chang force-pushed the support-stateful-cluster-failover branch from 46a19b5 to f80c613 Compare August 2, 2025 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants