-
Notifications
You must be signed in to change notification settings - Fork 333
Fix defederate loop #3481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix defederate loop #3481
Conversation
resolved conflicts: services/federator/src/Federator/Monitor.hs services/galley/src/Galley/API/Internal.hs plus lots of non-trivial changes to the local PR (sorry!)
-- similar processing run for removing the local domain from their federation list. | ||
onConversationUpdated dom convUpdate | ||
logAndIgnoreErrors @NoChanges | ||
(const "No Changes: Could not remove a local member from a remote conversation.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would you throw away the error information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NoChanges doesn't hold any information itself beyond the 0 arity data structure. We could use it in a Show
context, but that would make the error message a bit less nice to read.
The same idea holds for NotATeamMember. Only 2 of the GalleyError
constructors hold additional information, and we aren't catching either in this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, i meant to remove that question. all good!
liftIO $ throwIO e | ||
) | ||
pure | ||
-- This is the same policy as background-worker for retrying. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then why do it here as well? and what happens after 60 seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing it here in addition to in background worker will help speed things up, specifically when Brig returns an error as we try to delete the connections. Before, this error would be thrown all the way up to servant where it would make Brig retry the galley call. This means galley would have to do all the DB work of looking for the remote domain. This isn't much of a cost, but it is a cost nevertheless. This is basically short-circuiting that retrying logic here to keep things running a bit faster. We still want the recovering logic in background-worker in case the network drops, or something else goes wrong in the code.
I copied the policy from background worker because it seemed reasonable and I didn't want to experiment with new values when I'm trying to cover a stricter subset of errors.
After 60 seconds the retry delay will be at a constant 60 seconds.
-- `NoChanges` doesn't contain too many details, so no point in showing it here. | ||
) | ||
"Federation domain removal" | ||
catchBaddies $ do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly to minimize whitespace changes, but i also like the name. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both are nice reasons! The name is short and to the point.
. P.logAndIgnoreErrors @NoChanges (const "No changes") msgText | ||
|
||
mapAllErrors "Federation domain removal" $ do | ||
getConversation cnvId |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more straight-forward than extracting cnvId from lConv again.
oh, and i may have removed some comments that i didn't understand, @lepsa could you double-check? |
another thing: is there a way to exponentially backoff rabbitMQ directly? this would be the most robust way of making sure these crash loops are less aggressive, while still maintaining the same response times in almost all cases. |
We can back it off using dead letter queues and timeouts, but we will have to change some of the semantics around how we NACK messages we can't parse. Currently we NACK with redelivery for messages we can parse but failed to handle correctly. These are put back into the main queue as close as possible to their original position. Right at the front for our setup. We NACK without redelivery for messages we can't parse, i.e. malformed domains and JSON. We would need to change this to ACKing malformed messages and ignoring them in application logic, and NACKing without the redelivery flag for all messages we want to requeue with a delay. Easy changes, but we will have to put in a lot of comments around this and in the README so we don't catch ourselves in the "backwards" logic of it in the future. The basic idea for backing off messages is to have two queues, as explained here https://medium.com/@dotbox/delayed-requeuing-with-rabbitmq-dcbdf0026bf0 TL;DR, have 2 queues. A main one you pull from and a dead letter queue. The dead letter queue has a TTL timeout and has it's own dead lettering set to push the messages back to the main queue. When a message is NACKed by the consumer, it is put into the dead letter queue. After the timeout Rabbit will then put it back into the main queue where it can be delivered again. Since rabbit supports multiple queues dead-lettering to the same queue, I'm sure you could do something with the message headers and message exchanges to get pseudo-exponential backoff by having a set of queues, i.e. a 10s timeout, a 20 second timeout, 40s, ... |
Changes look good to me |
https://wearezeta.atlassian.net/browse/WPB-3668
Fixing how errors are caught, mapped, and logged.
Removed redundant errors from
deleteFederationDomainRemoteUserFromLocalConversations
Removing redundant constraints from several functions.
Removing commented code that isn't needed.
Updating
deleteFederationDomainOneOnOne
to have a recovering policy so it keeps retrying when there is an error returned over HTTP from Brig.Checklist
changelog.d