Skip to content

disable deepcopy for list action #5813

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 21, 2025

Conversation

CharlesQQ
Copy link
Member

@CharlesQQ CharlesQQ commented Nov 13, 2024

What type of PR is this?
/kind feature

What this PR does / why we need it:

When the node CPU usage rate is almost over 90%, it seems that deepcopy cost the most time:

I1113 14:36:11.161745      18 overridemanager.go:181] applyNamespacedOverrides, list configmap-beta-moa-demo-prod-681,  cluster member-cluster1, cost: 0.980144234
I1113 14:36:12.445315      18 overridemanager.go:181] applyNamespacedOverrides, list configmap-beta-moa-demo-prod-681,  cluster member-cluster2, cost: 0.811084427
image

After turning off list deepcopy, the list time is reduced to 0.1s or even lower.

I1113 15:43:39.212238      17 overridemanager.go:182] applyNamespacedOverrides, list beta-moa-demo-prod-283,  cluster member-cluster1, cost: 0.040210384
I1113 15:43:39.251189      17 overridemanager.go:182] applyNamespacedOverrides, list beta-moa-demo-prod-97,  cluster member-cluster2, cost: 0.018000362
I1113 15:43:39.314731      17 overridemanager.go:182] applyNamespacedOverrides, list beta-moa-demo-prod-283,  cluster member-cluster2, cost: 0.023638913
image

Which issue(s) this PR fixes:
part of #5790

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

`Performance`: `karmada-controller-manager`: Significant performance improvements have been achieved by reducing deepcopy operations during list processes:
- ​​Binding Controller​​: Response time reduced by 50%
​​- Dependencies Controller​​: Execution time improved 5× faster 
​​- Detector Controller​​: Processing time cut by 50% 

@karmada-bot karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 13, 2024
@karmada-bot karmada-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Nov 13, 2024
@codecov-commenter
Copy link

codecov-commenter commented Nov 13, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 90.69767% with 4 lines in your changes missing coverage. Please review.

Project coverage is 48.15%. Comparing base (b3f93ec) to head (d58cbe1).

Files with missing lines Patch % Lines
pkg/util/overridemanager/overridemanager.go 0.00% 0 Missing and 2 partials ⚠️
pkg/controllers/binding/binding_controller.go 91.66% 0 Missing and 1 partial ⚠️
...ers/binding/cluster_resource_binding_controller.go 66.66% 0 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5813      +/-   ##
==========================================
+ Coverage   48.14%   48.15%   +0.01%     
==========================================
  Files         677      677              
  Lines       56048    56068      +20     
==========================================
+ Hits        26982    26999      +17     
- Misses      27296    27298       +2     
- Partials     1770     1771       +1     
Flag Coverage Δ
unittests 48.15% <90.69%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@karmada-bot karmada-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 22, 2024
@zach593
Copy link
Contributor

zach593 commented Feb 22, 2025

Regarding this performance optimization, which clearly introduces side effects, I have two suggestions:

  1. How much of a performance improvement will it bring? Could you add benchmarks for these calls along with e2e performance comparison?

  2. Is it possible to add some mechanisms to mitigate these side effects? For example, adding unit tests or e2e tests so that if someone later introduces code that triggers this side effect (i.e., directly modifying the informer cache), our test suite can detect such behavior.

@CharlesQQ
Copy link
Member Author

How much of a performance improvement will it bring? Could you add benchmarks for these calls along with e2e performance comparison?

Through golang pprof analysis, it is found that during the restart period, deep copy will have obvious CPU time loss. After removing it, this part of the loss disappeared.
Maybe this change is difficult to compare through e2e testing. Do you have any good testing suggestions?

@zach593
Copy link
Contributor

zach593 commented Feb 24, 2025

Through golang pprof analysis, it is found that during the restart period, deep copy will have obvious CPU time loss. After removing it, this part of the loss disappeared.

This part is what I mean by e2e performance comparison, I'm not saying e2e tests.

@CharlesQQ
Copy link
Member Author

/retest

@CharlesQQ CharlesQQ changed the title feat(binding): disable deepcopy for list action disable deepcopy for list action Mar 7, 2025
@CharlesQQ
Copy link
Member Author

pls take a look at the e2e performance comparison @zach593

@zach593
Copy link
Contributor

zach593 commented Mar 9, 2025

Regarding this pprof result, what was the resource level at that time? When the controller is cold started, without or with this commit, how long does it take to clear the queue? How long does a single reconciliation take?

Reducing the time taken for an operation by an order of magnitude is great, but I think it would be more helpful to see how much of an overall improvement the change made, especially for an optimization that has potential side effects.

@zach593
Copy link
Contributor

zach593 commented Mar 11, 2025

2. Is it possible to add some mechanisms to mitigate these side effects? For example, adding unit tests or e2e tests so that if someone later introduces code that triggers this side effect (i.e., directly modifying the informer cache), our test suite can detect such behavior.

There is an env KUBE_CACHE_MUTATION_DETECTOR, which may cause panic when the informer cache is illegally modified, I think if we can enable this environment when running e2e tests, I guess it could eliminate the side effects of disable deepcopy.

@CharlesQQ
Copy link
Member Author

Reducing the time taken for an operation by an order of magnitude is great, but I think it would be more helpful to see how much of an overall improvement the change made, especially for an optimization that has potential side effects.

I will test in a real test environment and give optimization data

@CharlesQQ
Copy link
Member Author

CharlesQQ commented Mar 20, 2025

here is an env KUBE_CACHE_MUTATION_DETECTOR, which may cause panic when the informer cache is illegally modified, I think if we can enable this environment when running e2e tests, I guess it could eliminate the side effects of disable deepcopy.

KUBE_CACHE_MUTATION_DETECTOR=true take effect in go-client sharedinformer, does it also work in controller-runtime? If so, this environment variable should be added in e2e tests

@zach593
Copy link
Contributor

zach593 commented Mar 20, 2025

KUBE_CACHE_MUTATION_DETECTOR=true take effect in go-client sharedinformer, does it also work in controller-runtime?

func NewInformers(config *rest.Config, options *InformersOpts) *Informers {
newInformer := cache.NewSharedIndexInformer
if options.NewInformer != nil {

The answer is yes, it also applies to controller-runtime, because controller-runtime also uses client-go informer.

@zach593
Copy link
Contributor

zach593 commented Mar 27, 2025

A new concern just popped into my mind: does disabling deepCopy consistently lead to data being in a state of race conditions?

@CharlesQQ
Copy link
Member Author

A new concern just popped into my mind: does disabling deepCopy consistently lead to data being in a state of race conditions?

I don't understand what this sentence means, can you give me an example?

@CharlesQQ
Copy link
Member Author

CharlesQQ commented Mar 31, 2025

4000 workload, 8000 rb, 12C cpu:

for binding-controller, before optimization, it tooks 9 minutes; after optimization, it takes 4 minutes and 30 seconds

image

for dependencies resource detector, before optimization, it tooks 7 minutes and 15 seconds; after optimization, it takes 1 minutes and 30 seconds

image

@CharlesQQ
Copy link
Member Author

/retest

@zach593
Copy link
Contributor

zach593 commented Mar 31, 2025

I'm wondering if there will be a data race between the reading thread and the informer writing thread. This is something I suddenly realized, and I didn't think about it very seriously. Since I raised this question, I did some research and the result is it should not cause a data race.

// from oldest to newest
for _, d := range deltas {
obj := d.Object
switch d.Type {
case Sync, Replaced, Added, Updated:
if old, exists, err := clientState.Get(obj); err == nil && exists {
if err := clientState.Update(obj); err != nil {
return err
}

From the code above, you can see that when writing to the cache, the entire variable is replaced with the new object from the deltas . If the reading thread holds an object, when the writing thread updates the cache, the object held by the reading thread and the cache will no longer have any connection.

In other words, if the reading thread modifies the held object in the *a.xxx = yyy way, then this will cause a data race, and will also pollute the cache and objects already held by other reading threads. The object read from the cache later will also be polluted. Pollution will continue until the informer updates the object.

@CharlesQQ
Copy link
Member Author

CharlesQQ commented Apr 1, 2025

In other words, if the reading thread modifies the held object in the *a.xxx = yyy way, then this will cause a data race, and will also pollute the cache and objects already held by other reading threads. The object read from the cache later will also be polluted. Pollution will continue until the informer updates the object.

list action disabling deepCopy, must make sure the list item will not be updated, or make a deepcopy of it before updating it.

@karmada-bot karmada-bot removed the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Apr 14, 2025
@karmada-bot karmada-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 14, 2025
@CharlesQQ CharlesQQ force-pushed the disdabledeepcopy branch 2 times, most recently from 707cb18 to d62d6df Compare April 14, 2025 11:15
@CharlesQQ
Copy link
Member Author

CharlesQQ commented Apr 14, 2025

4000 workload, 12C cpu, when restarting

detector controller: takes 2 minutes and 30 seconds
image

after disable deepcopy for list action, it takes 1 minute and 15 seconds
image

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disable deepcopy indeed can improve performance, but should be used very cautiously.

/assign

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me.
But I think we should rename variable name or add more comments to notice the object can not be mutated.

@@ -190,7 +191,7 @@ func (c *ResourceBindingController) newOverridePolicyFunc() handler.MapFunc {
}

bindingList := &workv1alpha2.ResourceBindingList{}
if err := c.Client.List(ctx, bindingList); err != nil {
if err := c.Client.List(ctx, bindingList, &client.ListOptions{UnsafeDisableDeepCopy: ptr.To(true)}); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to add a namespace filter? Like:

                bindingList := &workv1alpha2.ResourceBindingList{}
-               if err := c.Client.List(ctx, bindingList); err != nil {
+               if err := c.Client.List(ctx, bindingList, &client.ListOptions{UnsafeDisableDeepCopy: ptr.To(true), Namespace: namespace}); err != nil {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bindingList --> readonlyBindingList?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to add a namespace filter

Yes

@CharlesQQ
Copy link
Member Author

But I think we should rename variable name or add more comments to notice the object can not be mutated.

Controller-runtime has a detailed description of UnsafeDisableDeepCopy, which is very clear. Is there no need to add it?

// UnsafeDisableDeepCopy indicates not to deep copy objects during list objects.
// Be very careful with this, when enabled you must DeepCopy any object before mutating it,
// otherwise you will mutate the object in the cache.
// +optional
UnsafeDisableDeepCopy *bool

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@karmada-bot karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Apr 21, 2025
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 21, 2025
@karmada-bot karmada-bot merged commit 87ee756 into karmada-io:master Apr 21, 2025
24 checks passed
@RainbowMango RainbowMango added this to the v1.14 milestone Aug 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants