Skip to content

Conversation

lepsa
Copy link
Contributor

@lepsa lepsa commented Jul 12, 2023

Adding graceful shutdown handling via POSIX signals and exception handling.

Kubernetes signals and timings come from Kubernetes best practices: terminating with grace

I've included two scripts for manually testing these changes locally. I don't know how we would coordinate pushing messages to rabbit, running background-worker, sending signals, and checking if the service quit gracefully. If someone does have an idea on this I'd like to include that in this PR.

Checklist

  • [:heavy_check_mark:] Add a new entry in an appropriate subdirectory of changelog.d
  • [:heavy_check_mark:] Read and follow the PR guidelines

lepsa added 6 commits July 12, 2023 17:41
Reworking how and where cleanup code is called so that we aren't
triggering a different set of exceptions about trying to send data over
closed network pipes.
Reworked where exceptions are caught and added per-consumer MVars to
signal when they are processing messages. This allows us to wait for
them to finish processing their current message before we close the
RabbitMQ connection.

Added scripts for filling RabbitMQ with messages and to simulate
kubernetes pod shutdown signalling.
@akshaymankar akshaymankar added the ok-to-test Approved for running tests in CI, overrides not-ok-to-test if both labels exist label Jul 12, 2023
Copy link
Member

@akshaymankar akshaymankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I don't understand why we have 1 MVar to tell the thread to stop and the same one to mean that the thread stopped. I might be interpreting this wrong, so please explain. I am not sure why not use two MVars: one to signal to the consumer thread to stop retrying, second as a signal from the thread that it stopped. There can be only 1 thread running per domain at any given time anyway.

  2. The MVar is just called mvar everywhere. Can it please have a better variable name?

@lepsa
Copy link
Contributor Author

lepsa commented Jul 13, 2023

  1. I don't understand why we have 1 MVar to tell the thread to stop and the same one to mean that the thread stopped. I might be interpreting this wrong, so please explain. I am not sure why not use two MVars: one to signal to the consumer thread to stop retrying, second as a signal from the thread that it stopped. There can be only 1 thread running per domain at any given time anyway.

    1. The MVar is just called mvar everywhere. Can it please have a better variable name?

The single mvar is primarily used to signal to cleanup code when the thread has finished processing a message from Rabbit so that we don't close AMQP connections and channels before threads have finished with them. That the mvar can block the thread from sending out another notification isn't the goal, but it is a nice benefit of using them.
The primary way in which consumers are stopped from ingesting messages is that we remove them from the Rabbit channel. This will tell amqp to deregister that consumer, but since we have threads and laziness to deal with being proactive about checking that resources aren't going to be used seems prudent.

Adding a config for the shutdown grace period so that the pod can limit
how long it is going allow requests to retry after a pod has received a
shutdown signal from Kubernetes.
@lepsa lepsa requested a review from akshaymankar July 14, 2023 06:14
@akshaymankar
Copy link
Member

Ormolu is not happy:

integration/test/Testlib/Types.hs...  ok
libs/brig-types/src/Brig/Types/User.hs...  ok
libs/gundeck-types/src/Gundeck/Types/Push/V2.hs...  ok
libs/types-common-aws/src/Util/Test/SQS.hs...  ok
libs/wire-api/test/unit/Test/Wire/API/User/Search.hs...  ok
services/background-worker/exec/Main.hs...  ok
services/background-worker/src/Wire/BackendNotificationPusher.hs
@@ -57,7 +57,7 @@
    --
    -- If we fail to deliver the notification after policy, the notification will be NACKed,
    -- and will be redelivered by RabbitMQ for another attempt, most likely by the same pod.
-   let delayUsablePercentage = 2/3 :: Float
+   let delayUsablePercentage = 2 / 3 :: Float
        policy = limitRetriesByCumulativeDelay (floor $ delayUsablePercentage * fromIntegral env.shutdownGraceTime * 1_000_000) $ fullJitterBackoff 10000
        logErrr willRetry (SomeException e) rs = do
          Log.err $
services/background-worker/src/Wire/BackendNotificationPusher.hs...  *** FAILED
services/background-worker/src/Wire/BackgroundWorker.hs...  ok
services/background-worker/src/Wire/BackgroundWorker/Env.hs
@@ -12,6 +12,7 @@
  import Imports
  import Network.AMQP.Extended
  import qualified Network.RabbitMqAdmin as RabbitMqAdmin
+ import Numeric.Natural
  import OpenSSL.Session (SSLOption (..))
  import qualified OpenSSL.Session as SSL
  import Prometheus
@@ -21,6 +22,5 @@
  import qualified System.Logger.Extended as Log
  import Util.Options
  import Wire.BackgroundWorker.Options
- import Numeric.Natural
  type IsWorking = Bool
services/background-worker/src/Wire/BackgroundWorker/Env.hs...  *** FAILED
services/background-worker/src/Wire/BackgroundWorker/Health.hs...  ok
services/background-worker/src/Wire/BackgroundWorker/Options.hs
@@ -3,9 +3,9 @@
  import Data.Aeson
  import Imports
  import Network.AMQP.Extended
+ import Numeric.Natural
  import System.Logger.Extended
  import Util.Options
- import Numeric.Natural
  data Opts = Opts
    { logLevel :: !Level,
services/background-worker/src/Wire/BackgroundWorker/Options.hs...  *** FAILED
services/proxy/src/Proxy/Run.hs...  ok
tools/stern/test/unit/Main.hs...  ok
ormolu failed on 3 files.
you can fix this by running 'make format' from the git repo root.
make: *** [Makefile:216: formatc] Error 1

lepsa added 3 commits July 17, 2023 19:25
Changing how AMQP cancelation is performed. This is because the
connection recovery code uses a new thread when recovering connections.
These threads weren't tracked by the `async` call that was being used to
signal cancellation. Now we're using MVars that are passed into
callbacks to track channels and consumers so that we can track what
needs to be cleaned up even with threads dying and being created.
Copy link
Member

@akshaymankar akshaymankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, minor nits.

lepsa added 4 commits July 18, 2023 18:54
Updating the defederation runner to use IORefs and MVars in the same way
as the notification pusher thread where the channel and consumers are
tracked and closed at SIGINT/SIGTERM.
Allowing the domain sync loop to detect when it is being cancelled by an
async cancel call and propagate the exception up the stack, breaking the
forever loop and allowing the rest of the code to carry on.
bracket_ (takeMVar runningFlag) (putMVar runningFlag ()) $ do
-- Non 2xx responses will throw an exception
-- So we are relying on that to be caught by recovering
resp <- liftIO $ httpLbs (req env d) manager
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like perfect opportunity to use servant, why are we building requests by hand? Perhaps this is also for another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that a separate PR would be best, as the other internal routes in galley should also be moved over to servant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Approved for running tests in CI, overrides not-ok-to-test if both labels exist
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants