-
Notifications
You must be signed in to change notification settings - Fork 333
WPB-2986: Background-worker graceful termination #3421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reworking how and where cleanup code is called so that we aren't triggering a different set of exceptions about trying to send data over closed network pipes.
Reworked where exceptions are caught and added per-consumer MVars to signal when they are processing messages. This allows us to wait for them to finish processing their current message before we close the RabbitMQ connection. Added scripts for filling RabbitMQ with messages and to simulate kubernetes pod shutdown signalling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I don't understand why we have 1 MVar to tell the thread to stop and the same one to mean that the thread stopped. I might be interpreting this wrong, so please explain. I am not sure why not use two MVars: one to signal to the consumer thread to stop retrying, second as a signal from the thread that it stopped. There can be only 1 thread running per domain at any given time anyway.
-
The
MVar
is just calledmvar
everywhere. Can it please have a better variable name?
services/background-worker/src/Wire/BackendNotificationPusher.hs
Outdated
Show resolved
Hide resolved
services/background-worker/src/Wire/BackendNotificationPusher.hs
Outdated
Show resolved
Hide resolved
services/background-worker/src/Wire/BackendNotificationPusher.hs
Outdated
Show resolved
Hide resolved
services/background-worker/src/Wire/BackendNotificationPusher.hs
Outdated
Show resolved
Hide resolved
services/background-worker/src/Wire/BackendNotificationPusher.hs
Outdated
Show resolved
Hide resolved
Co-authored-by: Akshay Mankar <[email protected]>
The single mvar is primarily used to signal to cleanup code when the thread has finished processing a message from Rabbit so that we don't close AMQP connections and channels before threads have finished with them. That the mvar can block the thread from sending out another notification isn't the goal, but it is a nice benefit of using them. |
Adding a config for the shutdown grace period so that the pod can limit how long it is going allow requests to retry after a pod has received a shutdown signal from Kubernetes.
Ormolu is not happy: integration/test/Testlib/Types.hs... ok
libs/brig-types/src/Brig/Types/User.hs... ok
libs/gundeck-types/src/Gundeck/Types/Push/V2.hs... ok
libs/types-common-aws/src/Util/Test/SQS.hs... ok
libs/wire-api/test/unit/Test/Wire/API/User/Search.hs... ok
services/background-worker/exec/Main.hs... ok
services/background-worker/src/Wire/BackendNotificationPusher.hs
@@ -57,7 +57,7 @@
--
-- If we fail to deliver the notification after policy, the notification will be NACKed,
-- and will be redelivered by RabbitMQ for another attempt, most likely by the same pod.
- let delayUsablePercentage = 2/3 :: Float
+ let delayUsablePercentage = 2 / 3 :: Float
policy = limitRetriesByCumulativeDelay (floor $ delayUsablePercentage * fromIntegral env.shutdownGraceTime * 1_000_000) $ fullJitterBackoff 10000
logErrr willRetry (SomeException e) rs = do
Log.err $
services/background-worker/src/Wire/BackendNotificationPusher.hs... *** FAILED
services/background-worker/src/Wire/BackgroundWorker.hs... ok
services/background-worker/src/Wire/BackgroundWorker/Env.hs
@@ -12,6 +12,7 @@
import Imports
import Network.AMQP.Extended
import qualified Network.RabbitMqAdmin as RabbitMqAdmin
+ import Numeric.Natural
import OpenSSL.Session (SSLOption (..))
import qualified OpenSSL.Session as SSL
import Prometheus
@@ -21,6 +22,5 @@
import qualified System.Logger.Extended as Log
import Util.Options
import Wire.BackgroundWorker.Options
- import Numeric.Natural
type IsWorking = Bool
services/background-worker/src/Wire/BackgroundWorker/Env.hs... *** FAILED
services/background-worker/src/Wire/BackgroundWorker/Health.hs... ok
services/background-worker/src/Wire/BackgroundWorker/Options.hs
@@ -3,9 +3,9 @@
import Data.Aeson
import Imports
import Network.AMQP.Extended
+ import Numeric.Natural
import System.Logger.Extended
import Util.Options
- import Numeric.Natural
data Opts = Opts
{ logLevel :: !Level,
services/background-worker/src/Wire/BackgroundWorker/Options.hs... *** FAILED
services/proxy/src/Proxy/Run.hs... ok
tools/stern/test/unit/Main.hs... ok
ormolu failed on 3 files.
you can fix this by running 'make format' from the git repo root.
make: *** [Makefile:216: formatc] Error 1 |
Changing how AMQP cancelation is performed. This is because the connection recovery code uses a new thread when recovering connections. These threads weren't tracked by the `async` call that was being used to signal cancellation. Now we're using MVars that are passed into callbacks to track channels and consumers so that we can track what needs to be cleaned up even with threads dying and being created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, minor nits.
services/background-worker/test/Test/Wire/BackendNotificationPusherSpec.hs
Outdated
Show resolved
Hide resolved
Updating the defederation runner to use IORefs and MVars in the same way as the notification pusher thread where the channel and consumers are tracked and closed at SIGINT/SIGTERM.
Allowing the domain sync loop to detect when it is being cancelled by an async cancel call and propagate the exception up the stack, breaking the forever loop and allowing the rest of the code to carry on.
bracket_ (takeMVar runningFlag) (putMVar runningFlag ()) $ do | ||
-- Non 2xx responses will throw an exception | ||
-- So we are relying on that to be caught by recovering | ||
resp <- liftIO $ httpLbs (req env d) manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like perfect opportunity to use servant, why are we building requests by hand? Perhaps this is also for another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that a separate PR would be best, as the other internal routes in galley should also be moved over to servant.
Adding graceful shutdown handling via POSIX signals and exception handling.
Kubernetes signals and timings come from Kubernetes best practices: terminating with grace
I've included two scripts for manually testing these changes locally. I don't know how we would coordinate pushing messages to rabbit, running background-worker, sending signals, and checking if the service quit gracefully. If someone does have an idea on this I'd like to include that in this PR.
Checklist
changelog.d