Skip to content

Conversation

lepsa
Copy link
Contributor

@lepsa lepsa commented Aug 7, 2023

https://wearezeta.atlassian.net/browse/WPB-3668

Fixing how errors are caught, mapped, and logged.

Removed redundant errors from deleteFederationDomainRemoteUserFromLocalConversations
Removing redundant constraints from several functions.
Removing commented code that isn't needed.
Updating deleteFederationDomainOneOnOne to have a recovering policy so it keeps retrying when there is an error returned over HTTP from Brig.

Checklist

  • [:heavy_check_mark:] Add a new entry in an appropriate subdirectory of changelog.d
  • [:heavy_check_mark:] Read and follow the PR guidelines

@lepsa lepsa changed the title DRAFT: Wpb 3631 fix defederate loop WPB 3631 fix defederate loop Aug 7, 2023
@lepsa lepsa marked this pull request as ready for review August 7, 2023 09:14
@fisx fisx mentioned this pull request Aug 7, 2023
2 tasks
@fisx fisx changed the title WPB 3631 fix defederate loop Fix defederate loop Aug 7, 2023
resolved conflicts:
        services/federator/src/Federator/Monitor.hs
        services/galley/src/Galley/API/Internal.hs

plus lots of non-trivial changes to the local PR (sorry!)
@fisx fisx changed the base branch from WPB-3631-fix-defederate-loop to develop August 7, 2023 14:59
-- similar processing run for removing the local domain from their federation list.
onConversationUpdated dom convUpdate
logAndIgnoreErrors @NoChanges
(const "No Changes: Could not remove a local member from a remote conversation.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would you throw away the error information?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NoChanges doesn't hold any information itself beyond the 0 arity data structure. We could use it in a Show context, but that would make the error message a bit less nice to read.

The same idea holds for NotATeamMember. Only 2 of the GalleyError constructors hold additional information, and we aren't catching either in this code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, i meant to remove that question. all good!

liftIO $ throwIO e
)
pure
-- This is the same policy as background-worker for retrying.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then why do it here as well? and what happens after 60 seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing it here in addition to in background worker will help speed things up, specifically when Brig returns an error as we try to delete the connections. Before, this error would be thrown all the way up to servant where it would make Brig retry the galley call. This means galley would have to do all the DB work of looking for the remote domain. This isn't much of a cost, but it is a cost nevertheless. This is basically short-circuiting that retrying logic here to keep things running a bit faster. We still want the recovering logic in background-worker in case the network drops, or something else goes wrong in the code.

I copied the policy from background worker because it seemed reasonable and I didn't want to experiment with new values when I'm trying to cover a stricter subset of errors.
After 60 seconds the retry delay will be at a constant 60 seconds.

-- `NoChanges` doesn't contain too many details, so no point in showing it here.
)
"Federation domain removal"
catchBaddies $ do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly to minimize whitespace changes, but i also like the name. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are nice reasons! The name is short and to the point.

. P.logAndIgnoreErrors @NoChanges (const "No changes") msgText

mapAllErrors "Federation domain removal" $ do
getConversation cnvId
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more straight-forward than extracting cnvId from lConv again.

@fisx fisx added the ok-to-test Approved for running tests in CI, overrides not-ok-to-test if both labels exist label Aug 7, 2023
@fisx
Copy link
Contributor

fisx commented Aug 7, 2023

oh, and i may have removed some comments that i didn't understand, @lepsa could you double-check?

@fisx
Copy link
Contributor

fisx commented Aug 7, 2023

another thing: is there a way to exponentially backoff rabbitMQ directly? this would be the most robust way of making sure these crash loops are less aggressive, while still maintaining the same response times in almost all cases.

@lepsa
Copy link
Contributor Author

lepsa commented Aug 8, 2023

another thing: is there a way to exponentially backoff rabbitMQ directly? this would be the most robust way of making sure these crash loops are less aggressive, while still maintaining the same response times in almost all cases.

We can back it off using dead letter queues and timeouts, but we will have to change some of the semantics around how we NACK messages we can't parse. Currently we NACK with redelivery for messages we can parse but failed to handle correctly. These are put back into the main queue as close as possible to their original position. Right at the front for our setup. We NACK without redelivery for messages we can't parse, i.e. malformed domains and JSON. We would need to change this to ACKing malformed messages and ignoring them in application logic, and NACKing without the redelivery flag for all messages we want to requeue with a delay. Easy changes, but we will have to put in a lot of comments around this and in the README so we don't catch ourselves in the "backwards" logic of it in the future.

The basic idea for backing off messages is to have two queues, as explained here https://medium.com/@dotbox/delayed-requeuing-with-rabbitmq-dcbdf0026bf0

TL;DR, have 2 queues. A main one you pull from and a dead letter queue. The dead letter queue has a TTL timeout and has it's own dead lettering set to push the messages back to the main queue.

When a message is NACKed by the consumer, it is put into the dead letter queue. After the timeout Rabbit will then put it back into the main queue where it can be delivered again. Since rabbit supports multiple queues dead-lettering to the same queue, I'm sure you could do something with the message headers and message exchanges to get pseudo-exponential backoff by having a set of queues, i.e. a 10s timeout, a 20 second timeout, 40s, ...

@lepsa
Copy link
Contributor Author

lepsa commented Aug 8, 2023

Changes look good to me

@fisx fisx merged commit ba6ca9f into wireapp:develop Aug 8, 2023
@lepsa lepsa deleted the WPB-3631-fix-defederate-loop branch August 10, 2023 04:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Approved for running tests in CI, overrides not-ok-to-test if both labels exist
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants