Fix defederate loop #3481

lepsa · 2023-08-07T07:04:02Z

https://wearezeta.atlassian.net/browse/WPB-3668

Fixing how errors are caught, mapped, and logged.

Removed redundant errors from deleteFederationDomainRemoteUserFromLocalConversations
Removing redundant constraints from several functions.
Removing commented code that isn't needed.
Updating deleteFederationDomainOneOnOne to have a recovering policy so it keeps retrying when there is an error returned over HTTP from Brig.

Checklist

[:heavy_check_mark:] Add a new entry in an appropriate subdirectory of changelog.d
[:heavy_check_mark:] Read and follow the PR guidelines

resolved conflicts: services/federator/src/Federator/Monitor.hs services/galley/src/Galley/API/Internal.hs plus lots of non-trivial changes to the local PR (sorry!)

fisx · 2023-08-07T14:26:44Z

services/galley/src/Galley/API/Internal.hs

-        -- similar processing run for removing the local domain from their federation list.
-        onConversationUpdated dom convUpdate
+    logAndIgnoreErrors @NoChanges
+      (const "No Changes: Could not remove a local member from a remote conversation.")


why would you throw away the error information?

NoChanges doesn't hold any information itself beyond the 0 arity data structure. We could use it in a Show context, but that would make the error message a bit less nice to read.

The same idea holds for NotATeamMember. Only 2 of the GalleyError constructors hold additional information, and we aren't catching either in this code.

sorry, i meant to remove that question. all good!

fisx · 2023-08-07T14:50:23Z

services/galley/src/Galley/API/Internal.hs

-          liftIO $ throwIO e
-      )
-      pure
+      -- This is the same policy as background-worker for retrying.


then why do it here as well? and what happens after 60 seconds?

Doing it here in addition to in background worker will help speed things up, specifically when Brig returns an error as we try to delete the connections. Before, this error would be thrown all the way up to servant where it would make Brig retry the galley call. This means galley would have to do all the DB work of looking for the remote domain. This isn't much of a cost, but it is a cost nevertheless. This is basically short-circuiting that retrying logic here to keep things running a bit faster. We still want the recovering logic in background-worker in case the network drops, or something else goes wrong in the code.

I copied the policy from background worker because it seemed reasonable and I didn't want to experiment with new values when I'm trying to cover a stricter subset of errors.
After 60 seconds the retry delay will be at a constant 60 seconds.

fisx · 2023-08-07T15:01:24Z

services/galley/src/Galley/API/Internal.hs

+            -- `NoChanges` doesn't contain too many details, so no point in showing it here.
+            )
+            "Federation domain removal"
+    catchBaddies $ do


mostly to minimize whitespace changes, but i also like the name. :)

Both are nice reasons! The name is short and to the point.

fisx · 2023-08-07T15:01:51Z

services/galley/src/Galley/API/Internal.hs

+            . P.logAndIgnoreErrors @NoChanges (const "No changes") msgText
+
+    mapAllErrors "Federation domain removal" $ do
+      getConversation cnvId


more straight-forward than extracting cnvId from lConv again.

fisx · 2023-08-07T15:04:29Z

oh, and i may have removed some comments that i didn't understand, @lepsa could you double-check?

fisx · 2023-08-07T15:12:20Z

another thing: is there a way to exponentially backoff rabbitMQ directly? this would be the most robust way of making sure these crash loops are less aggressive, while still maintaining the same response times in almost all cases.

lepsa · 2023-08-08T03:35:42Z

another thing: is there a way to exponentially backoff rabbitMQ directly? this would be the most robust way of making sure these crash loops are less aggressive, while still maintaining the same response times in almost all cases.

We can back it off using dead letter queues and timeouts, but we will have to change some of the semantics around how we NACK messages we can't parse. Currently we NACK with redelivery for messages we can parse but failed to handle correctly. These are put back into the main queue as close as possible to their original position. Right at the front for our setup. We NACK without redelivery for messages we can't parse, i.e. malformed domains and JSON. We would need to change this to ACKing malformed messages and ignoring them in application logic, and NACKing without the redelivery flag for all messages we want to requeue with a delay. Easy changes, but we will have to put in a lot of comments around this and in the README so we don't catch ourselves in the "backwards" logic of it in the future.

The basic idea for backing off messages is to have two queues, as explained here https://medium.com/@dotbox/delayed-requeuing-with-rabbitmq-dcbdf0026bf0

TL;DR, have 2 queues. A main one you pull from and a dead letter queue. The dead letter queue has a TTL timeout and has it's own dead lettering set to push the messages back to the main queue.

When a message is NACKed by the consumer, it is put into the dead letter queue. After the timeout Rabbit will then put it back into the main queue where it can be delivered again. Since rabbit supports multiple queues dead-lettering to the same queue, I'm sure you could do something with the message headers and message exchanges to get pseudo-exponential backoff by having a set of queues, i.e. a 10s timeout, a 20 second timeout, 40s, ...

lepsa · 2023-08-08T05:16:50Z

Changes look good to me

fisx and others added 7 commits August 4, 2023 16:40

When defederating, don't crash on already-deleted conversations.

14d6cf6

Changelog.

f95542b

...

374492d

Continuing work fisx started with the error logging

e2dda85

Changing deleteFederationDomainOneOnOne to retry until it succeeds

afcbcc1

Don't reinvent the wheel

f86d379

Cleaning up mapped errors

b61d41d

lepsa changed the title ~~DRAFT: Wpb 3631 fix defederate loop~~ WPB 3631 fix defederate loop Aug 7, 2023

lepsa marked this pull request as ready for review August 7, 2023 09:14

fisx mentioned this pull request Aug 7, 2023

Fix defederate loop #3477

Closed

2 tasks

fisx changed the title ~~WPB 3631 fix defederate loop~~ Fix defederate loop Aug 7, 2023

Merge branch 'develop' into WPB-3631-fix-defederate-loop

a24d7fe

resolved conflicts: services/federator/src/Federator/Monitor.hs services/galley/src/Galley/API/Internal.hs plus lots of non-trivial changes to the local PR (sorry!)

fisx changed the base branch from WPB-3631-fix-defederate-loop to develop August 7, 2023 14:59

fisx reviewed Aug 7, 2023

View reviewed changes

fisx added the ok-to-test Approved for running tests in CI, overrides not-ok-to-test if both labels exist label Aug 7, 2023

hi ci

5a35e9a

fisx approved these changes Aug 8, 2023

View reviewed changes

fisx merged commit ba6ca9f into wireapp:develop Aug 8, 2023

lepsa deleted the WPB-3631-fix-defederate-loop branch August 10, 2023 04:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix defederate loop #3481

Fix defederate loop #3481

Uh oh!

lepsa commented Aug 7, 2023 •

edited by fisx

Loading

Uh oh!

fisx Aug 7, 2023

Uh oh!

lepsa Aug 8, 2023

Uh oh!

fisx Aug 8, 2023

Uh oh!

fisx Aug 7, 2023

Uh oh!

lepsa Aug 8, 2023

Uh oh!

fisx Aug 7, 2023

Uh oh!

lepsa Aug 8, 2023

Uh oh!

fisx Aug 7, 2023

Uh oh!

fisx commented Aug 7, 2023

Uh oh!

fisx commented Aug 7, 2023

Uh oh!

lepsa commented Aug 8, 2023 •

edited

Loading

Uh oh!

lepsa commented Aug 8, 2023

Uh oh!

Uh oh!

Fix defederate loop #3481

Fix defederate loop #3481

Uh oh!

Conversation

lepsa commented Aug 7, 2023 • edited by fisx Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fisx commented Aug 7, 2023

Uh oh!

fisx commented Aug 7, 2023

Uh oh!

lepsa commented Aug 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lepsa commented Aug 8, 2023

Uh oh!

Uh oh!

lepsa commented Aug 7, 2023 •

edited by fisx

Loading

lepsa commented Aug 8, 2023 •

edited

Loading