WPB-2986: Background-worker graceful termination #3421

lepsa · 2023-07-12T08:10:11Z

Adding graceful shutdown handling via POSIX signals and exception handling.

Kubernetes signals and timings come from Kubernetes best practices: terminating with grace

I've included two scripts for manually testing these changes locally. I don't know how we would coordinate pushing messages to rabbit, running background-worker, sending signals, and checking if the service quit gracefully. If someone does have an idea on this I'd like to include that in this PR.

Checklist

[:heavy_check_mark:] Add a new entry in an appropriate subdirectory of changelog.d
[:heavy_check_mark:] Read and follow the PR guidelines

Reworking how and where cleanup code is called so that we aren't triggering a different set of exceptions about trying to send data over closed network pipes.

Reworked where exceptions are caught and added per-consumer MVars to signal when they are processing messages. This allows us to wait for them to finish processing their current message before we close the RabbitMQ connection. Added scripts for filling RabbitMQ with messages and to simulate kubernetes pod shutdown signalling.

akshaymankar

I don't understand why we have 1 MVar to tell the thread to stop and the same one to mean that the thread stopped. I might be interpreting this wrong, so please explain. I am not sure why not use two MVars: one to signal to the consumer thread to stop retrying, second as a signal from the thread that it stopped. There can be only 1 thread running per domain at any given time anyway.
The MVar is just called mvar everywhere. Can it please have a better variable name?

libs/wai-utilities/src/Network/Wai/Utilities/Server.hs

services/background-worker/src/Wire/BackendNotificationPusher.hs

Co-authored-by: Akshay Mankar <[email protected]>

lepsa · 2023-07-13T06:26:36Z

I don't understand why we have 1 MVar to tell the thread to stop and the same one to mean that the thread stopped. I might be interpreting this wrong, so please explain. I am not sure why not use two MVars: one to signal to the consumer thread to stop retrying, second as a signal from the thread that it stopped. There can be only 1 thread running per domain at any given time anyway.

The MVar is just called mvar everywhere. Can it please have a better variable name?

The single mvar is primarily used to signal to cleanup code when the thread has finished processing a message from Rabbit so that we don't close AMQP connections and channels before threads have finished with them. That the mvar can block the thread from sending out another notification isn't the goal, but it is a nice benefit of using them.
The primary way in which consumers are stopped from ingesting messages is that we remove them from the Rabbit channel. This will tell amqp to deregister that consumer, but since we have threads and laziness to deal with being proactive about checking that resources aren't going to be used seems prudent.

Adding a config for the shutdown grace period so that the pod can limit how long it is going allow requests to retry after a pod has received a shutdown signal from Kubernetes.

akshaymankar · 2023-07-17T08:16:34Z

Ormolu is not happy:

integration/test/Testlib/Types.hs...  ok
libs/brig-types/src/Brig/Types/User.hs...  ok
libs/gundeck-types/src/Gundeck/Types/Push/V2.hs...  ok
libs/types-common-aws/src/Util/Test/SQS.hs...  ok
libs/wire-api/test/unit/Test/Wire/API/User/Search.hs...  ok
services/background-worker/exec/Main.hs...  ok
services/background-worker/src/Wire/BackendNotificationPusher.hs
@@ -57,7 +57,7 @@
    --
    -- If we fail to deliver the notification after policy, the notification will be NACKed,
    -- and will be redelivered by RabbitMQ for another attempt, most likely by the same pod.
-   let delayUsablePercentage = 2/3 :: Float
+   let delayUsablePercentage = 2 / 3 :: Float
        policy = limitRetriesByCumulativeDelay (floor $ delayUsablePercentage * fromIntegral env.shutdownGraceTime * 1_000_000) $ fullJitterBackoff 10000
        logErrr willRetry (SomeException e) rs = do
          Log.err $
services/background-worker/src/Wire/BackendNotificationPusher.hs...  *** FAILED
services/background-worker/src/Wire/BackgroundWorker.hs...  ok
services/background-worker/src/Wire/BackgroundWorker/Env.hs
@@ -12,6 +12,7 @@
  import Imports
  import Network.AMQP.Extended
  import qualified Network.RabbitMqAdmin as RabbitMqAdmin
+ import Numeric.Natural
  import OpenSSL.Session (SSLOption (..))
  import qualified OpenSSL.Session as SSL
  import Prometheus
@@ -21,6 +22,5 @@
  import qualified System.Logger.Extended as Log
  import Util.Options
  import Wire.BackgroundWorker.Options
- import Numeric.Natural
  type IsWorking = Bool
services/background-worker/src/Wire/BackgroundWorker/Env.hs...  *** FAILED
services/background-worker/src/Wire/BackgroundWorker/Health.hs...  ok
services/background-worker/src/Wire/BackgroundWorker/Options.hs
@@ -3,9 +3,9 @@
  import Data.Aeson
  import Imports
  import Network.AMQP.Extended
+ import Numeric.Natural
  import System.Logger.Extended
  import Util.Options
- import Numeric.Natural
  data Opts = Opts
    { logLevel :: !Level,
services/background-worker/src/Wire/BackgroundWorker/Options.hs...  *** FAILED
services/proxy/src/Proxy/Run.hs...  ok
tools/stern/test/unit/Main.hs...  ok
ormolu failed on 3 files.
you can fix this by running 'make format' from the git repo root.
make: *** [Makefile:216: formatc] Error 1

Changing how AMQP cancelation is performed. This is because the connection recovery code uses a new thread when recovering connections. These threads weren't tracked by the `async` call that was being used to signal cancellation. Now we're using MVars that are passed into callbacks to track channels and consumers so that we can track what needs to be cleaned up even with threads dying and being created.

akshaymankar

Looks good overall, minor nits.

services/background-worker/src/Wire/BackgroundWorker.hs

services/background-worker/test/Test/Wire/BackendNotificationPusherSpec.hs

Updating the defederation runner to use IORefs and MVars in the same way as the notification pusher thread where the channel and consumers are tracked and closed at SIGINT/SIGTERM.

Allowing the domain sync loop to detect when it is being cancelled by an async cancel call and propagate the exception up the stack, breaking the forever loop and allowing the rest of the code to carry on.

services/background-worker/src/Wire/Defederation.hs

akshaymankar · 2023-07-19T08:13:18Z

services/background-worker/src/Wire/Defederation.hs

+        bracket_ (takeMVar runningFlag) (putMVar runningFlag ()) $ do
+          -- Non 2xx responses will throw an exception
+          -- So we are relying on that to be caught by recovering
+          resp <- liftIO $ httpLbs (req env d) manager


This seems like perfect opportunity to use servant, why are we building requests by hand? Perhaps this is also for another PR.

I think that a separate PR would be best, as the other internal routes in galley should also be moved over to servant.

lepsa added 6 commits July 12, 2023 17:41

wip

7140be5

WPB-2986: Reworking shutdown code to avoid unexpected exceptions

a8c88c0

Reworking how and where cleanup code is called so that we aren't triggering a different set of exceptions about trying to send data over closed network pipes.

wip: threads being weird and not shutting down

dc6855d

Changelog entry

953b665

PR formatting

1cc5693

akshaymankar added the ok-to-test Approved for running tests in CI, overrides not-ok-to-test if both labels exist label Jul 12, 2023

akshaymankar requested changes Jul 12, 2023

View reviewed changes

lepsa and others added 2 commits July 13, 2023 15:46

Apply suggestions from code review

d075053

Co-authored-by: Akshay Mankar <[email protected]>

WPB-2986: Fixing updates from PR feedback. More comments.

358fea1

WPB-2986: Adding a config option for Kubernetes shutdown grace period.

ff45eb0

Adding a config for the shutdown grace period so that the pod can limit how long it is going allow requests to retry after a pod has received a shutdown signal from Kubernetes.

lepsa requested a review from akshaymankar July 14, 2023 06:14

lepsa added 3 commits July 17, 2023 13:53

WPB-2986: Script formatting for CI

3f68e9a

WPB-2986: More formatting

3389719

WPB-2986: More formatting

011b6be

lepsa added 3 commits July 17, 2023 19:25

code formatting

33c8736

WPB-2986: Removing dead code

ddb8f8c

akshaymankar approved these changes Jul 18, 2023

View reviewed changes

services/background-worker/src/Wire/BackgroundWorker.hs Outdated Show resolved Hide resolved

services/background-worker/test/Test/Wire/BackendNotificationPusherSpec.hs Outdated Show resolved Hide resolved

lepsa added 4 commits July 18, 2023 18:54

WPB-2986: PR notes

c275bdd

WPB-2986: Merging in origin/develop.

12b9c47

Updating the defederation runner to use IORefs and MVars in the same way as the notification pusher thread where the channel and consumers are tracked and closed at SIGINT/SIGTERM.

WPB-2986: Changing exception handling for the domain sync loop.

e0c5f9e

Allowing the domain sync loop to detect when it is being cancelled by an async cancel call and propagate the exception up the stack, breaking the forever loop and allowing the rest of the code to carry on.

WPB-2986: formatting

6a4a9f8

akshaymankar requested changes Jul 19, 2023

View reviewed changes

lepsa added 4 commits July 19, 2023 18:28

WPB-2986: PR feedback

493b4d5

Merge remote-tracking branch 'origin/develop' into WPB-2986

609544a

WPB-2986: Fixing tests

7cc3ec0

Formatting

a28718e

Hi CI

f24fbd5

akshaymankar approved these changes Jul 24, 2023

View reviewed changes

akshaymankar merged commit 8052efb into wireapp:develop Jul 24, 2023

zebot mentioned this pull request Aug 11, 2023

Release 2023-08-11 - (expected chart version 4.36.0) #3493

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WPB-2986: Background-worker graceful termination #3421

WPB-2986: Background-worker graceful termination #3421

Uh oh!

lepsa commented Jul 12, 2023

Uh oh!

akshaymankar left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lepsa commented Jul 13, 2023

Uh oh!

akshaymankar commented Jul 17, 2023

Uh oh!

akshaymankar left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

akshaymankar Jul 19, 2023

Uh oh!

lepsa Jul 19, 2023

Uh oh!

Uh oh!

WPB-2986: Background-worker graceful termination #3421

WPB-2986: Background-worker graceful termination #3421

Uh oh!

Conversation

lepsa commented Jul 12, 2023

Checklist

Uh oh!

akshaymankar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lepsa commented Jul 13, 2023

Uh oh!

akshaymankar commented Jul 17, 2023

Uh oh!

akshaymankar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

akshaymankar Jul 19, 2023

Choose a reason for hiding this comment

Uh oh!

lepsa Jul 19, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!