Fixing memory leak in KDD watchers #526

heschlie · 2017-09-22T07:09:10Z

Description

When we start a cluster either missing our CRDs or we delete some of our CRDs from a running cluster, the syncer goes into a loop attempting to rewatch those resources and failing due to them not existing. Sometime during that process a resync is triggered, which at the end of we destroy the watchers, then immediately rewatch the resources. This creates a lot of churn in the client-go lib that seems to start fragmenting memory which then never gets cleaned up, and our footprint grows by ~2MB/minute. Also, if you add the CRD back, felix never seems to release the memory, I added the CRD back, which stopped the leaking, but it never freed any memory back up even after a few hours.

I removed the code which destroys the watchers at the end of the sync, and ran a cluster overnight with a missing CRD, and did not see any leaking occur. I am reasonably sure this should be ok, but would like some more insight as to why we decided to destroy them at the end of the resync. The comment seems to indicate we might leak some resources when we (re)start watching, but each watch seems to first check to see if we're already watching that resource before trying to start.

I have some pprof dumps from felix I can share as well to show master with CRD, master with CRD deleted, and this fix with CRD deleted.

Todos

Tests
Documentation
Release note

Release Note

None required

fasaxc · 2017-09-22T09:20:20Z

@heschlie I think we need to reopen the watchers for correctness too. Doing the snapshot may have put the watchers out of sync; for example, we may be watching for event 1000 but the snapshot just updated us to event 1100 so the event stream that comes out of the watcher will be 1001, 1002, ... If we apply event 1001 on top of snapshot 1100 then we may get inconsistent. I don't think the the KDD snapshot/watcher resolution logic is set up to handle that case.

I wonder if we're leaking goroutines. Maybe calling Stop() on the watcher isn't enough to shut it down correctly. (Although its interface clearly states that it should.)

fasaxc · 2017-09-22T12:55:51Z

@neiljerram and I chatted, we think my thoughts above might be out of date since he implemented selective resync.

nelljerram · 2017-09-22T13:23:48Z

would like some more insight as to why we decided to destroy them at the end of the resync

@heschlie I think the relevant PRs here are #420 and #437.

The situation after #420 was that if KDD thought it was in any kind of bother, it resynced (i.e. relisted) all of the resources and restarted all of the watchers. At that point there was no distinction between conditions which needed resync and conditions that needed a watch restart; and no independent handling for different resources. The #420 change specifically (by @fasaxc) was to add that closeAllWatchers line, each time a full resync happened, because otherwise the complete previous set of watchers was leaked.

Then, in #437 (by me), we realized that

some conditions only need restarting the watch, and not relisting; specifically (IIRC) when a watch ends because the ResourceVersion has incremented by more than 1000 since the watch started
if we need to relist or restart the watch for one resource, we may not need to do that for all resources.

So I added independent handling for each resource, and restarting watches without relisting. In that change, I left the closeAllWatchers line logically as it was - but now suspect that that was wrong, and that this PR is correct to remove that line.

Certainly, if we've hit a problem with resource A, we should not need to restart the watch for a different resource B - but clearly that is the effect of the existing closeAllWatchers line, and so it's better to remove it.

The only worry is: we would then have cases where, for a resource A, we relist that resource but do not restart the watch - which may be invalid as @fasaxc says in his comment above. It feels safer to restart the watch whenever we relist (which is in practice what we have been doing until now, because of the overzealous closeAllWatchers line)

If that is right, some further code changes would be appropriate:

at the start of the loop, for each resource for which syn.needsResync is true, call syn.closeWatcher

the places below that currently do

syn.needsResync[X] = true
syn.closeWatcher(X)

can be simplified to just

syn.needsResync[X] = true

nelljerram · 2017-09-22T13:34:48Z

Sorry, an update/clarification of the previous comment; I took too much credit...

It was PR #433 (by @caseydavenport) that introduced the resyncing of resources individually. (And so the delta of #437 was only for the conditions where we can re-watch without relisting.) The tests in #433 are quite explicit about expecting that 1 resource resync => 7 new watch calls; so I'm a little worried that there may actually be some reason why that is needed. It would be best to check with Casey if he is available.

heschlie · 2017-09-22T14:33:59Z

@fasaxc @neiljerram Thanks a ton for the detailed feedback! I'll discuss further with @caseydavenport today and see what is/might be needed. I had figured there would me more to this and wanted to get some extra eyes on it.

caseydavenport · 2017-09-23T22:54:14Z

@neiljerram thanks for the nice analysis.

That closeAllWatchers call is overzealous - I don't think there is a need to close ALL watchers every time. That was likely left there in the last PR because it was easier and seemed like a safe-if-unnecessary precaution. I don't recall any particular reason that needed to be there.

@neiljerram's suggestion above sounds like the right thing to do to me.

heschlie · 2017-09-26T17:30:46Z

Currently running this commit against a cluster missing the GlobalNetworkPolicy CRD to see if it has stopped the leak.

heschlie · 2017-09-26T20:01:44Z

Memory footprint seems stable after 3.5 hours (staying sub 30MB).

heschlie · 2017-09-26T22:23:03Z

@fasaxc @neiljerram @caseydavenport I think this is ready for some review.

nelljerram

Looks great - thanks.

caseydavenport

LGTM - Thanks @heschlie!

caseydavenport · 2017-09-28T21:34:45Z

@heschlie could you squash the commits?

When we create a cluster without our CRDs or if we remove one from a running cluster it will start retrying to watch that non-existent resource. During this loop we "resync" and then destroy the old watchers. This process kicks off the leak which stems from somewhere in client-go, which could be from fragmenting memory by quickly creating and destroying the watches and underlying channels. We now only close out watchers that have needed resync, this prevents us from retrying watches on things that don't need to be stopped.

Send connection failed before sending in-sync

heschlie added the bugfix label Sep 22, 2017

heschlie added this to the Calico v2.6.0 milestone Sep 22, 2017

heschlie requested review from fasaxc and robbrockbank September 22, 2017 07:09

fasaxc requested a review from nelljerram September 22, 2017 10:30

caseydavenport modified the milestones: Calico v2.6.0, Calico v2.6.x Sep 23, 2017

nelljerram approved these changes Sep 28, 2017

View reviewed changes

caseydavenport approved these changes Sep 28, 2017

View reviewed changes

heschlie force-pushed the watcher-leak branch from f7b987f to 9659769 Compare September 28, 2017 21:39

heschlie merged commit 71101ce into projectcalico:master Sep 28, 2017

caseydavenport modified the milestones: Calico v2.6.x, Calico v2.6.3 Nov 18, 2017

song-jiang pushed a commit to song-jiang/libcalico-go that referenced this pull request Jul 19, 2021

Merge pull request projectcalico#526 from robbrockbank/rcc-timing

b88dfcd

Send connection failed before sending in-sync

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixing memory leak in KDD watchers #526

Fixing memory leak in KDD watchers #526

Uh oh!

heschlie commented Sep 22, 2017

Uh oh!

fasaxc commented Sep 22, 2017

Uh oh!

fasaxc commented Sep 22, 2017

Uh oh!

nelljerram commented Sep 22, 2017

Uh oh!

nelljerram commented Sep 22, 2017

Uh oh!

heschlie commented Sep 22, 2017

Uh oh!

caseydavenport commented Sep 23, 2017

Uh oh!

heschlie commented Sep 26, 2017

Uh oh!

heschlie commented Sep 26, 2017

Uh oh!

heschlie commented Sep 26, 2017

Uh oh!

nelljerram left a comment

Uh oh!

caseydavenport left a comment

Uh oh!

caseydavenport commented Sep 28, 2017

Uh oh!

Uh oh!

Fixing memory leak in KDD watchers #526

Fixing memory leak in KDD watchers #526

Uh oh!

Conversation

heschlie commented Sep 22, 2017

Description

Todos

Release Note

Uh oh!

fasaxc commented Sep 22, 2017

Uh oh!

fasaxc commented Sep 22, 2017

Uh oh!

nelljerram commented Sep 22, 2017

Uh oh!

nelljerram commented Sep 22, 2017

Uh oh!

heschlie commented Sep 22, 2017

Uh oh!

caseydavenport commented Sep 23, 2017

Uh oh!

heschlie commented Sep 26, 2017

Uh oh!

heschlie commented Sep 26, 2017

Uh oh!

heschlie commented Sep 26, 2017

Uh oh!

nelljerram left a comment

Choose a reason for hiding this comment

Uh oh!

caseydavenport left a comment

Choose a reason for hiding this comment

Uh oh!

caseydavenport commented Sep 28, 2017

Uh oh!

Uh oh!