Skip to content

Conversation

Callisto13
Copy link
Contributor

@Callisto13 Callisto13 commented May 4, 2022

This fixes an issue where the clustersNamespaces and the userNamespaces
caches were not updated when Clusters were deleted.

This was most noticeable for clustersNamespaces since the GetAll there
would return everything in the cache, dated or not.

For the userNamespaces cache this was hidden, since the GetAll in this
case would receive an up to date list of the cached clusters and then
use that to look up individual keys, returning them in a collated list.
This meant it looked like the cache was updated, but it was only
showing us what we wanted to see; the cache was still quietly building
up.

The solution here is to generate a Hash based on an ordered list of the
names of current clusters. When updating the clustersNamespaces cache, we
compare the saved with the current hash, and clear the clustersNamespaces if
they differ.
We also clear the userNamespaces at this point; since the userNamespaces
cache is dependent on both the clusters and the clustersNamespaces
cache, it is simpler to clear everything together. (I can change this if wanted
nbd.)

The userNsList fun has been pulled apart for testing purposes, but is
otherwise the same. The testing that the userNamespaces cache has been
cleared was harder, since, as I said above the GetAll does not really
get all, but "gets all based on this cluster list". I can expose new
methods and change this into a genuine List if that is preferred, but
have not done that for now.

Closes #2083

@Callisto13 Callisto13 added the bug Something isn't working label May 4, 2022
@Callisto13 Callisto13 requested review from foot and luizbafilho May 4, 2022 09:47
@luizbafilho
Copy link
Contributor

This looks good, but it's not enough, since we need to clean up the cf.usersNamespaces too, but different from the clustersNamespaces we cannot do every 30 seconds, other wise will have to ask user's permissions to the cluster too frequently, slowing things down.

So I propose that we create a hash for the list of clusters, and only reset everything up if that hash changes, so basically you'd do:

  • List the cluster names ordered and save a concatenated string in the factory, we could add a GetHash to the Clusters cache for instance.
  • When updating the namespaces, check if current hash is different from the saved one, if so, clear clustersNamespace and the usersNamespaces, if not leave as is.

With that, we should expect things to be eventually consistent, after 30 seconds or so.

@foot
Copy link
Contributor

foot commented May 4, 2022

I get some very strange behaviour..

  1. https://localhost:8000/v1/helmreleases? returns a cannot find cluster= (empty) kind of thing (sorry I'll try and get the corrent error message again)
  2. https://localhost:8000/v1/helmreleases?clusterName=Default complains it cannot find a secret that only exists in the leaf (ewq2)
  3. https://localhost:8000/v1/helmreleases?clusterName=ewq2 complains it cannot find a secret in mgmt (Default)

@Callisto13
Copy link
Contributor Author

we need to clean up the cf.usersNamespaces too

I had deliberately left that to handle separately, but you are right that they have to be considered together 👍

@luizbafilho
Copy link
Contributor

I get some very strange behaviour..

  1. https://localhost:8000/v1/helmreleases? returns a cannot find cluster= (empty) kind of thing (sorry I'll try and get the corrent error message again)
  2. https://localhost:8000/v1/helmreleases?clusterName=Default complains it cannot find a secret that only exists in the leaf (ewq2)
  3. https://localhost:8000/v1/helmreleases?clusterName=ewq2 complains it cannot find a secret in mgmt (Default)

This is a separate issue, related to that handler specifically, looking at the code it tries to do some secret querying indeed.

I also noticed you are trying to query by ClusterName I don't think all list handlers support that, feel free to open another issue to add that too.

@Callisto13 Callisto13 force-pushed the clear-cluster-ns-cache branch from 01d7772 to 8e071b5 Compare May 5, 2022 13:30
@Callisto13 Callisto13 changed the title fix: Clear the clustersNamespaces cache on Update fix: Clear related caches on clusters update May 5, 2022
@Callisto13
Copy link
Contributor Author

Updated commit and description

@Callisto13 Callisto13 force-pushed the clear-cluster-ns-cache branch from 8e071b5 to dafb4bf Compare May 5, 2022 14:11
Copy link
Contributor

@luizbafilho luizbafilho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks great beside couple small nits.

@foot would be possible for you to test it from this branch? or does it need to be merged?

@foot
Copy link
Contributor

foot commented May 5, 2022 via email

@foot
Copy link
Contributor

foot commented May 6, 2022

Still not quite working unfortunately.

/v1/flux_runtime_objects returns:

{
  "code": 2,
  "message": "could not list deployments in namespace flux-system: cluster=ewq2 not found",
  "details": []
}

We could add some debug logging? 🤔 , I can put together a cluster to play with if it helps too.

This fixes an issue where the clustersNamespaces and the userNamespaces
caches were not updated when Clusters were deleted.

This was most noticeable for clustersNamespaces since the GetAll there
would return everything in the cache, dated or not.

For the userNamespaces cache this was hidden, since the GetAll in this
case would receive an up to date list of the cached clusters and then
use that to look up individual keys, returning them in a collated list.
This meant it _looked like_ the cache was updated, but it was only
showing us what we wanted to see; the cache was still quietly building
up.

The solution here is to generate a Hash based on an ordered list of the
names of current clusters. When updating the clustersNamespaces cache, we
compare the saved with the current hash, and clear the clustersNamespaces if
they differ.
We also clear the userNamespaces at this point; since the userNamespaces
cache is dependent on both the clusters and the clustersNamespaces
cache, it is simpler to clear everything together.

The `userNsList` fun has been pulled apart for testing purposes, but is
otherwise the same. The testing that the userNamespaces cache has been
cleared was harder, since, as I said above the GetAll does not _really_
get all, but "gets all based on this cluster list". I can expose new
methods and change this into a genuine List if that is preferred, but
have not done that for now.
@Callisto13 Callisto13 force-pushed the clear-cluster-ns-cache branch from dafb4bf to 3b47eaa Compare May 6, 2022 09:05
@Callisto13
Copy link
Contributor Author

@foot we could jump on a call together, might be faster?

I only have a couple of hours today before I go on holiday, so someone may need to take this off me if it drags on.

@foot
Copy link
Contributor

foot commented May 6, 2022

After some testing and being more careful about observing the 30s window etc it seems to be working great! 💯

LGTM ⭐

@Callisto13 Callisto13 merged commit f4e42a0 into main May 6, 2022
@Callisto13 Callisto13 deleted the clear-cluster-ns-cache branch May 6, 2022 10:57
foot added a commit to weaveworks/weave-gitops-enterprise that referenced this pull request May 11, 2022
Includes important MC fixes:
- weaveworks/weave-gitops#2137
- weaveworks/weave-gitops#2085

Fixes needed to address core changes:
-  `wego-admin` is the new user to login as (no longer `admin`)
- Core can now be configured with a fake-client so we can remove some hacks
- Adds a little bit of debugging around MC querying
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

flux-runtime endpoint tries to list deployments in a cluster after the cluster is removed
3 participants