Skip to content

Conversation

heschlie
Copy link
Contributor

Description

Fixes projectcalico/calico#1057

When we start a cluster either missing our CRDs or we delete some of our CRDs from a running cluster, the syncer goes into a loop attempting to rewatch those resources and failing due to them not existing. Sometime during that process a resync is triggered, which at the end of we destroy the watchers, then immediately rewatch the resources. This creates a lot of churn in the client-go lib that seems to start fragmenting memory which then never gets cleaned up, and our footprint grows by ~2MB/minute. Also, if you add the CRD back, felix never seems to release the memory, I added the CRD back, which stopped the leaking, but it never freed any memory back up even after a few hours.

I removed the code which destroys the watchers at the end of the sync, and ran a cluster overnight with a missing CRD, and did not see any leaking occur. I am reasonably sure this should be ok, but would like some more insight as to why we decided to destroy them at the end of the resync. The comment seems to indicate we might leak some resources when we (re)start watching, but each watch seems to first check to see if we're already watching that resource before trying to start.

I have some pprof dumps from felix I can share as well to show master with CRD, master with CRD deleted, and this fix with CRD deleted.

Todos

  • Tests
  • Documentation
  • Release note

Release Note

None required

@heschlie heschlie added this to the Calico v2.6.0 milestone Sep 22, 2017
@fasaxc
Copy link
Member

fasaxc commented Sep 22, 2017

@heschlie I think we need to reopen the watchers for correctness too. Doing the snapshot may have put the watchers out of sync; for example, we may be watching for event 1000 but the snapshot just updated us to event 1100 so the event stream that comes out of the watcher will be 1001, 1002, ... If we apply event 1001 on top of snapshot 1100 then we may get inconsistent. I don't think the the KDD snapshot/watcher resolution logic is set up to handle that case.

I wonder if we're leaking goroutines. Maybe calling Stop() on the watcher isn't enough to shut it down correctly. (Although its interface clearly states that it should.)

@fasaxc fasaxc requested a review from nelljerram September 22, 2017 10:30
@fasaxc
Copy link
Member

fasaxc commented Sep 22, 2017

@neiljerram and I chatted, we think my thoughts above might be out of date since he implemented selective resync.

@nelljerram
Copy link
Member

would like some more insight as to why we decided to destroy them at the end of the resync

@heschlie I think the relevant PRs here are #420 and #437.

The situation after #420 was that if KDD thought it was in any kind of bother, it resynced (i.e. relisted) all of the resources and restarted all of the watchers. At that point there was no distinction between conditions which needed resync and conditions that needed a watch restart; and no independent handling for different resources. The #420 change specifically (by @fasaxc) was to add that closeAllWatchers line, each time a full resync happened, because otherwise the complete previous set of watchers was leaked.

Then, in #437 (by me), we realized that

  • some conditions only need restarting the watch, and not relisting; specifically (IIRC) when a watch ends because the ResourceVersion has incremented by more than 1000 since the watch started

  • if we need to relist or restart the watch for one resource, we may not need to do that for all resources.

So I added independent handling for each resource, and restarting watches without relisting. In that change, I left the closeAllWatchers line logically as it was - but now suspect that that was wrong, and that this PR is correct to remove that line.

Certainly, if we've hit a problem with resource A, we should not need to restart the watch for a different resource B - but clearly that is the effect of the existing closeAllWatchers line, and so it's better to remove it.

The only worry is: we would then have cases where, for a resource A, we relist that resource but do not restart the watch - which may be invalid as @fasaxc says in his comment above. It feels safer to restart the watch whenever we relist (which is in practice what we have been doing until now, because of the overzealous closeAllWatchers line)

If that is right, some further code changes would be appropriate:

  • at the start of the loop, for each resource for which syn.needsResync is true, call syn.closeWatcher

  • the places below that currently do

    syn.needsResync[X] = true
    syn.closeWatcher(X)
    

    can be simplified to just

    syn.needsResync[X] = true
    

@nelljerram
Copy link
Member

Sorry, an update/clarification of the previous comment; I took too much credit...

It was PR #433 (by @caseydavenport) that introduced the resyncing of resources individually. (And so the delta of #437 was only for the conditions where we can re-watch without relisting.) The tests in #433 are quite explicit about expecting that 1 resource resync => 7 new watch calls; so I'm a little worried that there may actually be some reason why that is needed. It would be best to check with Casey if he is available.

@heschlie
Copy link
Contributor Author

@fasaxc @neiljerram Thanks a ton for the detailed feedback! I'll discuss further with @caseydavenport today and see what is/might be needed. I had figured there would me more to this and wanted to get some extra eyes on it.

@caseydavenport
Copy link
Member

@neiljerram thanks for the nice analysis.

That closeAllWatchers call is overzealous - I don't think there is a need to close ALL watchers every time. That was likely left there in the last PR because it was easier and seemed like a safe-if-unnecessary precaution. I don't recall any particular reason that needed to be there.

@neiljerram's suggestion above sounds like the right thing to do to me.

@heschlie
Copy link
Contributor Author

Currently running this commit against a cluster missing the GlobalNetworkPolicy CRD to see if it has stopped the leak.

@heschlie
Copy link
Contributor Author

Memory footprint seems stable after 3.5 hours (staying sub 30MB).

@heschlie
Copy link
Contributor Author

@fasaxc @neiljerram @caseydavenport I think this is ready for some review.

Copy link
Member

@nelljerram nelljerram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - thanks.

Copy link
Member

@caseydavenport caseydavenport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks @heschlie!

@caseydavenport
Copy link
Member

@heschlie could you squash the commits?

When we create a cluster without our CRDs or if we remove one from a
running cluster it will start retrying to watch that non-existent
resource. During this loop we "resync" and then destroy the old
watchers. This process kicks off the leak which stems from somewhere in
client-go, which could be from fragmenting memory by quickly creating
and destroying the watches and underlying channels.

We now only close out watchers that have needed resync, this prevents us
from retrying watches on things that don't need to be stopped.
@heschlie heschlie merged commit 71101ce into projectcalico:master Sep 28, 2017
song-jiang pushed a commit to song-jiang/libcalico-go that referenced this pull request Jul 19, 2021
Send connection failed before sending in-sync
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory leak in calico-felix (KDD Syncer)
4 participants