Add graceful termination to coturn chart #2456

sysvinit · 2022-06-07T16:00:55Z

This is the deployment side of wireapp/coturn#5. This change adds additional options to the coturn Helm chart to enable graceful restart of coturn pods when used with a suitable service discovery mechanism. This relies on a post-stop hook script shipped in the (at time of writing) newest version of our coturn image to monitor when the number of active allocations on a coturn instance goes to zero, and blocking pod termination until this condition is met (subject to a grace period timeout).

Checklist

This adds a preStop hook to coturn pods which waits until the number of active allocations goes to zero before the coturn process is signalled to terminate. This can be used with an external service discovery mechanism to enable graceful rolling restarts, by removing each running coturn pod from the set of active advertised coturn instances in turn.

Starting with this version, our coturn image ships with the pre-stop hook script, which is required for graceful termination.

flokli · 2022-06-13T11:36:55Z

charts/coturn/values.yaml

+# This chart optionally supports waiting for traffic to drain from coturn
+# before pods are terminated. Warning: coturn does not have any way to steer
+# incoming client traffic away from itself on its own, so this functionality
+# relies on external traffic management (e.g. service discovery for active coturn
+# instances) to prevent clients from sending new requests to pods which are in a
+# terminating state.


During draining, shouldn't the endpoint slice of the service backing the statefulset have the draining endpoints with Ready being set to false?

So if that's used for service discovery, it should already work?

So if that's used for service discovery, it should already work?

This should indeed just work in environments which have Kube-aware service discovery. @julialongtin tells me, however, that this generally is not the case in on-premise environments, and that the preference is to set things like the TURN endpoints statically. I'm hesitant to assume too much Kube prior knowledge in something we're shipping to customers, and I'd rather have it explicitly stated that while this graceful termination functionality exists, it needs extra moving parts in order to work correctly (which might not be feasible to implement in all environments).

Yes, on-prem environments might just have the IP adresses/hostnames hardcoded, and don't actually take things out of the rotation.

So it'd probably even be more important for coturn to do some more graceful termination when receiving SIGTERM, by just not accepting new connections when receiving a SIGTERM.

Does restund do any sort of graceful termination on e.g. SIGTERM? My current understanding is that the failover strategy we have for restund is similar to what I'm trying to achieve here, that being we rely on fiddling with service discovery elsewhere to rotate instances out of the advertised instance group, and then watch the metrics until traffic has drained from the instance we're removing before we kill it off.

In the first instance I'm trying to implement automation for coturn which is at least approximately equally as bad as what we have for restund already, which can then be the basis for incremental future development. As I mentioned in the other review thread, this chart is currently marked as alpha-quality, so I'm not really interested in bikeshedding the perfect coturn-on-k8s management solution here. That being said, if you want write some patches against coturn to introduce application-level graceful termination functionality then that would certainly not be unwelcome, and would address the points you've raised in the other review thread.

flokli · 2022-06-13T11:38:57Z

charts/coturn/values.yaml

+# relies on external traffic management (e.g. service discovery for active coturn
+# instances) to prevent clients from sending new requests to pods which are in a
+# terminating state.
+coturnGracefulTermination: false


I'd probably have this feature enabled by default, and make a disableCoturnGracefulTermination defaulting to false.

Things like nginx also gracefully terminate and don't just blindly kill all connections when retrieving SIGTERM - the fact we're using a preStop script to drain is an implementation detail.

I'd probably have this feature enabled by default, and make a disableCoturnGracefulTermination defaulting to false.

I'm not sure which of [enabled by default]/[disabled by default] is the least surprising option here. I would prefer to keep the default behaviour of the coturn chart close to that of the restund chart (which does not have any graceful termination functionality), given that the former should eventually replace the latter. Enabling this functionality increases the runtime complexity/number of moving parts, and I'm not sure whether that's a good default.

[…] Enabling this functionality increases the runtime complexity/number of moving parts […]

… and this is mostly due to wireapp/coturn#5 now ungracefully terminating connections on SIGTERM, which is the need for the additional moving parts (preStop) in first place.

Before wireapp/coturn#5, terminating endpoints in the service resource were marked with ready=False, so assuming service discovery honors that, it'd took these turn servers out of the list - and we'd automatically have some "draining period" where new connections wouldn't happen due to that. The only annoying part was coturn would entirely ignore SIGTERM and turning it off would take all of terminationGracePeriodSeconds, instead of just the time until the last client disconnected.

wireapp/coturn#5 only makes SIGTERM ungraceful - coturn "immediately"(during the next event loop run) exits itself, entirely ignoring there's still open connections - so we're now adding the preStop script to prevent it from doing it in an ungraceful way.

This means, with wireapp/coturn#5 merged, but users not explicitly setting coturnGracefulTermination: true, we're suddenly terminating connections way less graceful.

That's a regression IMHO.

We should either put some more graceful termination logic into coturn itself (and probably get this upstreamed), so we don't need to maintain complex preStop logic we're not very confident about downstream, or accept the fact that faster, but still graceful draining is more complicated.

I think we're starting to talk at cross-purposes here.

… and this is mostly due to wireapp/coturn#5 now ungracefully terminating connections on SIGTERM, which is the need for the additional moving parts (preStop) in first place.

No. wireapp/coturn#6 introduces the SIGTERM handler. wireapp/coturn#5 is the change which added the preStop hook script.

The preStop hook was introduced before the signal handling change was made, with the goal of postponing signals being sent to coturn until connections had drained, under incorrect assumptions about coturn's signal handling behaviour when running as PID 1.

The SIGTERM handler was subsequently introduced to ensure that coturn would exit correctly when signaled by the control plane (under the assumption that either the preStop hook has exited, or the grace period timeout has expired). I consider the addition of the SIGTERM handler to be a bugfix, as it means that coturn's signal handling behaviour is consistent irrespective of whether it runs as PID 1 or not.

Before wireapp/coturn#5, terminating endpoints in the service resource were marked with ready=False, so assuming service discovery honors that, it'd took these turn servers out of the list - and we'd automatically have some "draining period" where new connections wouldn't happen due to that. The only annoying part was coturn would entirely ignore SIGTERM and turning it off would take all of terminationGracePeriodSeconds, instead of just the time until the last client disconnected.

This is not correct if you are not using the modifications to the Helm chart in this PR. Before this PR, there is no means to configure the termination grace period or enable the preStop hook. Hence, the default grace period of 30 second applies (see here and scroll down a bit).

If you're using coturn without the SIGTERM handler fix with the coturn Helm chart currently in develop, then when the control plane requests the termination of a coturn pod, the coturn process will ignore the initial SIGTERM, and then be forcibly killed 30 seconds later when the control plane delivers SIGKILL.

If you're using coturn with the SIGTERM handler fix with the develop Helm chart, then when the control plane requests the termination of a coturn pod, the coturn process will exit upon receipt of the initial SIGTERM.

In both of these cases, coturn dies relatively promptly after the control plane requests pod termination, and in both cases this happens regardless of the client load on a given coturn pod.

Now, considering the changes in this PR -- this expects that the coturn image contains the preStop hook script, so when the config flag in the Helm chart is enabled it will actually work correctly (ref: e212919). Therefore we eliminate the case where the version of the Helm chart in this PR is using an old version of coturn which has neither the hook script nor the SIGTERM fix.

If the operator has coturnGracefulTermination set to false, then when a coturn pod is terminated, it exits immediately on SIGTERM, as with the second case I described above. If this option is set to true, then you hit the extra logic, where the coturn pod may continue running for up to coturnGracePeriodSeconds after termination is requested during which time it is not marked as Ready in the pod status reported by the control plane.

This means, with wireapp/coturn#5 merged, but users not explicitly setting coturnGracefulTermination: true, we're suddenly terminating connections way less graceful.

I disagree. The changes introduced by this PR, in their default configuration, do not significantly alter the termination behaviour of the services defined in this chart in comparison to the version of the chart preceding this PR.

I would also offer a gentle reminder that the coturn chart is currently marked in the tree as alpha-quality, and on this basis I think that significant changes to the chart's behaviour at this stage are within scope if we want to make them.

Yes, I might have confused the two PR numbers. It was a bit confusing the preStop script came before the SIGTERM handler.

My stance is that handling SIGTERM in coturn shouldn't be ungraceful in first place - it's confusingly unintuitive, compared to how other applications handle it. They usually switch into a mode where they don't accept new connections, and properly exit once the still-open connections are done (and usually whatever's supervising the process, be it kubernetes, systemd, or an impatient operator sending SIGKILL after a timeout).

So with that implemented, wireapp/coturn#5 and the preStop hook in this PR here wouldn't be necessary at all.

It's probably best to carry the discussion on how SIGTERM should be handled upstream. We don't want to keep maintaining a coturn fork downstream anyways.

Alright, I think I understand your point better now.

Should I interpret your commentary here as an outright rejection of this PR? Or do we merge this, fix coturn, and then revise the Helm chart here afterwards?

I do feel that we're getting into "perfect is the enemy of the good" territory here.

I don't think we should merge this, and should undo wireapp/coturn#5 and wireapp/coturn#6 until SIGTERM is handled gracefully.

I do feel that we're getting into "perfect is the enemy of the good" territory here.

If SIGTERM is handled gracefully, we don't need any preStop hooks, and as long as it isn't (and coturn plainly ignores SIGTERM), it's probably fine to just wait terminationGracePeriodSeconds even if no clients are connected.

This overall feels way less brittle, and a graceful SIGTERM handler is probably much more likely to get upstreamed as well.

I don't understand how complicated it is to push for changes into coturn, but I suspect it is not a cake walk. I think it is fine to write hacks to get things working and then push for changes upstream.
So, unless doing this poses some significant problems for Wire users or operators, IMHO we should create another issue for fixing the upstream and merge this.

I don't think we should merge this, and should undo wireapp/coturn#5 and wireapp/coturn#6 until SIGTERM is handled gracefully.

Tracking ticket for this functionality is here: https://wearezeta.atlassian.net/browse/SQPIT-1013. If not now, we have a reminder to do this later.

akshaymankar · 2022-06-14T15:35:44Z

charts/coturn/values.yaml

+# Grace period for terminating coturn pods, after which they will be forcibly
+# terminated. This setting is only effective when coturnGracefulTermination is
+# set to true.
+coturnGracePeriodSeconds: 86400 # one day


Since this is a statefulset, by default it would upgrade pods one by one. So, this number will get multiplied by number of instances. Maybe it is better to either make this a deployment or set some updateStrategy in the statefulset to make things go faster.

We can't have more than one instance running at the same time, as they'd bind on the same ports in the host namespace.

We can probably set .spec.updateStrategy.rollingUpdate.maxUnavailable to more than the default (1) to allow faster updates - but on the other hand, the other TURN nodes need to be able to cope with the additional traffic.

akshaymankar · 2022-06-14T15:40:40Z

charts/coturn/values.yaml

+# relies on external traffic management (e.g. service discovery for active coturn
+# instances) to prevent clients from sending new requests to pods which are in a
+# terminating state.
+coturnGracefulTermination: false


I don't understand how complicated it is to push for changes into coturn, but I suspect it is not a cake walk. I think it is fine to write hacks to get things working and then push for changes upstream.
So, unless doing this poses some significant problems for Wire users or operators, IMHO we should create another issue for fixing the upstream and merge this.

sysvinit temporarily deployed to cachix June 7, 2022 16:01 Inactive

sysvinit added 3 commits June 13, 2022 11:31

changelog: add entry for coturn graceful termination feature.

bdf828b

charts/coturn: update coturn container image version.

e212919

Starting with this version, our coturn image ships with the pre-stop hook script, which is required for graceful termination.

sysvinit force-pushed the sysvinit/coturn-graceful-termination branch from 9935711 to e212919 Compare June 13, 2022 10:37

sysvinit temporarily deployed to cachix June 13, 2022 10:37 Inactive

sysvinit marked this pull request as ready for review June 13, 2022 10:44

sysvinit requested review from akshaymankar, flokli and smatting June 13, 2022 10:49

flokli reviewed Jun 13, 2022

View reviewed changes

akshaymankar reviewed Jun 14, 2022

View reviewed changes

julialongtin approved these changes Jun 15, 2022

View reviewed changes

sysvinit merged commit 13fa067 into develop Jun 15, 2022

sysvinit deleted the sysvinit/coturn-graceful-termination branch June 15, 2022 15:48

This was referenced Jul 5, 2022

Release 2022-07-05 - (expected chart version 4.18.0) #2534

Closed

Release 2022-07-05 - (expected chart version 4.18.0) #2537

Merged

Release 2022-07-05 - (expected chart version 4.18.0) #2540

Closed

Add graceful termination to coturn chart #2456

Add graceful termination to coturn chart #2456

Uh oh!

Conversation

sysvinit commented Jun 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sysvinit commented Jun 7, 2022 •

edited

Loading