-
Notifications
You must be signed in to change notification settings - Fork 980
[Proposal] Support cluster failover for stateful workloads #6495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Proposal] Support cluster failover for stateful workloads #6495
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc @mszacillo |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #6495 +/- ##
==========================================
- Coverage 45.34% 45.34% -0.01%
==========================================
Files 687 687
Lines 56393 56393
==========================================
- Hits 25571 25569 -2
- Misses 29226 29228 +2
Partials 1596 1596
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Hi @mszacillo, any comments? |
63e48b0
to
9510e81
Compare
/gemini-review |
Hello @RainbowMango, Thank you for initiating the review with Overall, the proposal is well-structured, clearly articulates the problem, and presents a logical extension of existing functionalities within Karmada. The API modifications are minimal and seem to integrate well with the current Here are a few comments and suggestions based on my review: General Observations & Questions
API Design
Testing
Overall, this is a solid proposal that addresses a critical need for stateful workloads in multi-cluster environments. Once the above points are considered, I believe this PR will be in excellent shape. Let me know if you have any questions or would like to discuss these points further. |
|
||
## Summary | ||
|
||
The advantage of stateless workloads is their higher fault tolerance, as the loss of workloads does not affect user sessions. However, for stateful workloads (such as Flink), the loss of workloads may result in the loss of session data, preventing the workload from recovering correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the session data
thing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lift it from #5116
It may be:
- User Session Information: This includes data such as the user's login status, session identifiers, session start time, and other session-related metadata.
- Temporary Data: Data that the application generates while processing user requests, which needs to be consistent across multiple requests.
- State Information: The state that the application maintains during its operation, such as intermediate computation results, checkpoints, and other runtime data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be easier to understand if we use persistent data
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
The advantage of stateless workloads is their higher fault tolerance, as the loss of workloads does not impact application correctness or require data recovery. In contrast, stateful workloads (like Flink) depend on checkpointed or saved state to recover after failures. If that state is lost, the job may not be able to resume from where it left off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mszacillo
|
||
The advantage of stateless workloads is their higher fault tolerance, as the loss of workloads does not affect user sessions. However, for stateful workloads (such as Flink), the loss of workloads may result in the loss of session data, preventing the workload from recovering correctly. | ||
|
||
For workloads, failover scenarios include two cases: the first is when the workload itself fails and needs to be migrated to another cluster; the second is when the cluster encounter a failure, such as a network partition, requiring some workloads within the cluster to be migrated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does network partition
mean the network of a specific cluster is blocked?
I doubt it's the typical scenarios that should trigger the cluster failover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
network partition
is a term, and we can refer to the explanation provided by Wikipedia.
Let me provie a few more typical scenarios:
e.g., node crashes, network partitions, or control plane outages
|
||
Karmada currently supports propagating various types of resources, including Kubernetes objects and CRDs, which is particularly useful for ensuring the resilience of stateful workloads in multi-cluster environments. However, the Karmada system's processing logic is based on the assumption that the propagated resources are stateless. In failover scenarios, users may want to preserve a certain state before failure so that workloads can resume from the point where they stopped in the previous cluster. | ||
|
||
For CRDs that handle data processing (such as Flink or Spark), restarting the application from a previous checkpoint can be particularly useful. This allows the application to seamlessly resume processing from the point of interruption while avoiding duplicate processing. Karmada v1.12 supports stateful workload migration after application failures, and the next step is to extend support for stateful workload migration following cluster failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CRD can't handle data processing...
|
||
## Motivation | ||
|
||
Karmada currently supports propagating various types of resources, including Kubernetes objects and CRDs, which is particularly useful for ensuring the resilience of stateful workloads in multi-cluster environments. However, the Karmada system's processing logic is based on the assumption that the propagated resources are stateless. In failover scenarios, users may want to preserve a certain state before failure so that workloads can resume from the point where they stopped in the previous cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Karmada currently supports propagating various types of resources, including Kubernetes objects and CRDs, which is particularly useful for ensuring the resilience of stateful workloads in multi-cluster environments.
I don't get why supports various resources
is useful for stateful workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lift it from #5116
Let me revise the wording.
What situations may cause the feature to fail, and will it cause system crashes? | ||
|
||
As mentioned above, this proposal focuses on scenarios where all replicas of a stateful application are migrated together. If users split the replicas of a stateful application and propagate them across different clusters, and the system migrates only some replicas, the stateful application failover recovery feature may not work effectively. | ||
|
||
This feature will not cause system crashes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We won't accept any feature that would cause the system to crash; that's common sense, so it's not necessary to explain.
As mentioned above, this proposal focuses on scenarios where all replicas of a stateful application are migrated together. If users split the replicas of a stateful application and propagate them across different clusters, and the system migrates only some replicas, the stateful application failover recovery feature may not work effectively.
- This is the limitation or constraint of this feature, I would prefer to put it to
Notes/Constraints/Caveats (Optional)
. - Another question is, can the system identify whether a statefull workload can be split across clusters, considering the use case of propagation ETCD replicas to multiple clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the limitation or constraint of this feature, I would prefer to put it to Notes/Constraints/Caveats (Optional).
+1
Another question is, can the system identify whether a statefull workload can be split across clusters, considering the use case of propagation ETCD replicas to multiple clusters.
Sorry, I didn't quite understand the question.
- The `StatePreservation` definition in the `PropagationPolicy` API will transition to Beta; | ||
- The `PreservedLabelState` and `ClustersBeforeFailover` fields in the `ResourceBinding` API will transition to Beta; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When speaking of graduation criteria, we are saying the maturity of a feature, not an API field...
|
||
- The `StatePreservation` definition in the `PropagationPolicy` API will transition to Beta; | ||
- The `PreservedLabelState` and `ClustersBeforeFailover` fields in the `ResourceBinding` API will transition to Beta; | ||
- End-to-end (E2E) testing has been completed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many E2E tests are planned for this feature? I mean, how to know the E2E testing is completed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supplemented in the Test Planning
section.
In this proposal, we have uncommented the `Cluster` field in the `FailoverBehavior` struct to define behaviors related to cluster failover. We have also added the `StatePreservation` field to the `ClusterFailoverBehavior` struct, which is used to specify the policy for preserving and restoring state data of stateful applications during failover events. The type of this field is exactly the same as the `StatePreservation` field in the `ApplicationFailoverBehavior` struct. | ||
|
||
The `ResourceBinding` API already meets the design goals and can directly reuse the `PreservedLabelState` and `ClustersBeforeFailover` fields in the `gracefulEvictionTask` without introducing new modifications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where explains the PurgeMode
thing.
|
||
// Cluster indicates failover behaviors in case of cluster failure. | ||
// +optional | ||
Cluster *ClusterFailoverBehavior `json:"cluster,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay here, I have been traveling recently.
I don’t have strong objections to introducing ClusterFailoverBehavior as a separate field. That said, from my perspective, do we expect users to define different state preservation between cluster and application failover?
Right now, we're running on an internal commit that simply reuses the same state preservation defined under FailoverBehavior during cluster failover events, and that’s worked well so far.
I think defining a separate ClusterFailoverBehavior makes more sense once the cluster failover logic is more fully developed — for example, if users can define custom cluster failover criteria. At the moment, including state preservation options in both places feels somewhat redundant. But I also understand if this is just a first step towards having a more formally defined ClusterFailover feature, so maybe we can address this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mszacillo
Apologies for the delay here, I have been traveling recently.
It sounds wonderful. 👍
At the moment, including state preservation options in both places feels somewhat redundant. But I also understand if this is just a first step towards having a more formally defined ClusterFailover feature, so maybe we can address this later.
@RainbowMango and I have also discussed this issue. Currently, defining StatePreservation
in both places does seem somewhat redundant. However, we are also considering another point: we may not yet have sufficient scenarios to prove that the methods of obtaining information during application migration caused by cluster failures and application failures are exactly the same.
Therefore, this redundancy might be an intermediate state that will be eliminated in the future; it could also be validated by new scenarios and become the final state.
I would like to understand whether the current redundancy design is convenient to modify for the solution you are currently running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, we are also considering another point: we may not yet have sufficient scenarios to prove that the methods of obtaining information during application migration caused by cluster failures and application failures are exactly the same.
Yes, it's exactly the reason. We don't know if the requirements are exactly the same between application failover and cluster failover. Maybe they require different labels to identify the reason.
@XiShanYongYe-Chang This is a good question, and I think we should explain it in this proposal.
9510e81
to
f6a6f41
Compare
Hi @RainbowMango @mszacillo An updated version has been made based on the comments. Can you help review it again? |
ffff931
to
46a19b5
Compare
Hi @kevin-wangzefeng, can you help take a review? |
Signed-off-by: changzhen <[email protected]>
46a19b5
to
f80c613
Compare
What type of PR is this?
/kind feature
/kind documentation
What this PR does / why we need it:
Support cluster failover for stateful workloads.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: