-
Notifications
You must be signed in to change notification settings - Fork 333
Add graceful termination to coturn chart #2456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This adds a preStop hook to coturn pods which waits until the number of active allocations goes to zero before the coturn process is signalled to terminate. This can be used with an external service discovery mechanism to enable graceful rolling restarts, by removing each running coturn pod from the set of active advertised coturn instances in turn.
Starting with this version, our coturn image ships with the pre-stop hook script, which is required for graceful termination.
9935711
to
e212919
Compare
# This chart optionally supports waiting for traffic to drain from coturn | ||
# before pods are terminated. Warning: coturn does not have any way to steer | ||
# incoming client traffic away from itself on its own, so this functionality | ||
# relies on external traffic management (e.g. service discovery for active coturn | ||
# instances) to prevent clients from sending new requests to pods which are in a | ||
# terminating state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During draining, shouldn't the endpoint slice of the service backing the statefulset have the draining endpoints with Ready
being set to false
?
So if that's used for service discovery, it should already work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if that's used for service discovery, it should already work?
This should indeed just work in environments which have Kube-aware service discovery. @julialongtin tells me, however, that this generally is not the case in on-premise environments, and that the preference is to set things like the TURN endpoints statically. I'm hesitant to assume too much Kube prior knowledge in something we're shipping to customers, and I'd rather have it explicitly stated that while this graceful termination functionality exists, it needs extra moving parts in order to work correctly (which might not be feasible to implement in all environments).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, on-prem environments might just have the IP adresses/hostnames hardcoded, and don't actually take things out of the rotation.
So it'd probably even be more important for coturn to do some more graceful termination when receiving SIGTERM, by just not accepting new connections when receiving a SIGTERM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does restund do any sort of graceful termination on e.g. SIGTERM? My current understanding is that the failover strategy we have for restund is similar to what I'm trying to achieve here, that being we rely on fiddling with service discovery elsewhere to rotate instances out of the advertised instance group, and then watch the metrics until traffic has drained from the instance we're removing before we kill it off.
In the first instance I'm trying to implement automation for coturn which is at least approximately equally as bad as what we have for restund already, which can then be the basis for incremental future development. As I mentioned in the other review thread, this chart is currently marked as alpha-quality, so I'm not really interested in bikeshedding the perfect coturn-on-k8s management solution here. That being said, if you want write some patches against coturn to introduce application-level graceful termination functionality then that would certainly not be unwelcome, and would address the points you've raised in the other review thread.
# relies on external traffic management (e.g. service discovery for active coturn | ||
# instances) to prevent clients from sending new requests to pods which are in a | ||
# terminating state. | ||
coturnGracefulTermination: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably have this feature enabled by default, and make a disableCoturnGracefulTermination
defaulting to false.
Things like nginx also gracefully terminate and don't just blindly kill all connections when retrieving SIGTERM - the fact we're using a preStop
script to drain is an implementation detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably have this feature enabled by default, and make a disableCoturnGracefulTermination defaulting to false.
I'm not sure which of [enabled by default]/[disabled by default] is the least surprising option here. I would prefer to keep the default behaviour of the coturn chart close to that of the restund chart (which does not have any graceful termination functionality), given that the former should eventually replace the latter. Enabling this functionality increases the runtime complexity/number of moving parts, and I'm not sure whether that's a good default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[…] Enabling this functionality increases the runtime complexity/number of moving parts […]
… and this is mostly due to wireapp/coturn#5 now ungracefully terminating connections on SIGTERM, which is the need for the additional moving parts (preStop
) in first place.
Before wireapp/coturn#5, terminating endpoints in the service resource were marked with ready=False
, so assuming service discovery honors that, it'd took these turn servers out of the list - and we'd automatically have some "draining period" where new connections wouldn't happen due to that. The only annoying part was coturn would entirely ignore SIGTERM and turning it off would take all of terminationGracePeriodSeconds
, instead of just the time until the last client disconnected.
wireapp/coturn#5 only makes SIGTERM ungraceful - coturn "immediately"(during the next event loop run) exits itself, entirely ignoring there's still open connections - so we're now adding the preStop
script to prevent it from doing it in an ungraceful way.
This means, with wireapp/coturn#5 merged, but users not explicitly setting coturnGracefulTermination: true
, we're suddenly terminating connections way less graceful.
That's a regression IMHO.
We should either put some more graceful termination logic into coturn itself (and probably get this upstreamed), so we don't need to maintain complex preStop
logic we're not very confident about downstream, or accept the fact that faster, but still graceful draining is more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're starting to talk at cross-purposes here.
… and this is mostly due to wireapp/coturn#5 now ungracefully terminating connections on SIGTERM, which is the need for the additional moving parts (preStop) in first place.
No. wireapp/coturn#6 introduces the SIGTERM handler. wireapp/coturn#5 is the change which added the preStop hook script.
The preStop hook was introduced before the signal handling change was made, with the goal of postponing signals being sent to coturn until connections had drained, under incorrect assumptions about coturn's signal handling behaviour when running as PID 1.
The SIGTERM handler was subsequently introduced to ensure that coturn would exit correctly when signaled by the control plane (under the assumption that either the preStop hook has exited, or the grace period timeout has expired). I consider the addition of the SIGTERM handler to be a bugfix, as it means that coturn's signal handling behaviour is consistent irrespective of whether it runs as PID 1 or not.
Before wireapp/coturn#5, terminating endpoints in the service resource were marked with ready=False, so assuming service discovery honors that, it'd took these turn servers out of the list - and we'd automatically have some "draining period" where new connections wouldn't happen due to that. The only annoying part was coturn would entirely ignore SIGTERM and turning it off would take all of terminationGracePeriodSeconds, instead of just the time until the last client disconnected.
This is not correct if you are not using the modifications to the Helm chart in this PR. Before this PR, there is no means to configure the termination grace period or enable the preStop hook. Hence, the default grace period of 30 second applies (see here and scroll down a bit).
-
If you're using coturn without the SIGTERM handler fix with the coturn Helm chart currently in
develop
, then when the control plane requests the termination of a coturn pod, the coturn process will ignore the initial SIGTERM, and then be forcibly killed 30 seconds later when the control plane delivers SIGKILL. -
If you're using coturn with the SIGTERM handler fix with the
develop
Helm chart, then when the control plane requests the termination of a coturn pod, the coturn process will exit upon receipt of the initial SIGTERM.
In both of these cases, coturn dies relatively promptly after the control plane requests pod termination, and in both cases this happens regardless of the client load on a given coturn pod.
Now, considering the changes in this PR -- this expects that the coturn image contains the preStop hook script, so when the config flag in the Helm chart is enabled it will actually work correctly (ref: e212919). Therefore we eliminate the case where the version of the Helm chart in this PR is using an old version of coturn which has neither the hook script nor the SIGTERM fix.
If the operator has coturnGracefulTermination
set to false
, then when a coturn pod is terminated, it exits immediately on SIGTERM, as with the second case I described above. If this option is set to true
, then you hit the extra logic, where the coturn pod may continue running for up to coturnGracePeriodSeconds
after termination is requested during which time it is not marked as Ready
in the pod status reported by the control plane.
This means, with wireapp/coturn#5 merged, but users not explicitly setting coturnGracefulTermination: true, we're suddenly terminating connections way less graceful.
I disagree. The changes introduced by this PR, in their default configuration, do not significantly alter the termination behaviour of the services defined in this chart in comparison to the version of the chart preceding this PR.
I would also offer a gentle reminder that the coturn chart is currently marked in the tree as alpha-quality, and on this basis I think that significant changes to the chart's behaviour at this stage are within scope if we want to make them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I might have confused the two PR numbers. It was a bit confusing the preStop script came before the SIGTERM handler.
My stance is that handling SIGTERM in coturn shouldn't be ungraceful in first place - it's confusingly unintuitive, compared to how other applications handle it. They usually switch into a mode where they don't accept new connections, and properly exit once the still-open connections are done (and usually whatever's supervising the process, be it kubernetes, systemd, or an impatient operator sending SIGKILL after a timeout).
So with that implemented, wireapp/coturn#5 and the preStop hook in this PR here wouldn't be necessary at all.
It's probably best to carry the discussion on how SIGTERM should be handled upstream. We don't want to keep maintaining a coturn fork downstream anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I think I understand your point better now.
Should I interpret your commentary here as an outright rejection of this PR? Or do we merge this, fix coturn, and then revise the Helm chart here afterwards?
I do feel that we're getting into "perfect is the enemy of the good" territory here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should merge this, and should undo wireapp/coturn#5 and wireapp/coturn#6 until SIGTERM is handled gracefully.
I do feel that we're getting into "perfect is the enemy of the good" territory here.
If SIGTERM is handled gracefully, we don't need any preStop
hooks, and as long as it isn't (and coturn plainly ignores SIGTERM), it's probably fine to just wait terminationGracePeriodSeconds
even if no clients are connected.
This overall feels way less brittle, and a graceful SIGTERM handler is probably much more likely to get upstreamed as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how complicated it is to push for changes into coturn, but I suspect it is not a cake walk. I think it is fine to write hacks to get things working and then push for changes upstream.
So, unless doing this poses some significant problems for Wire users or operators, IMHO we should create another issue for fixing the upstream and merge this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should merge this, and should undo wireapp/coturn#5 and wireapp/coturn#6 until SIGTERM is handled gracefully.
Tracking ticket for this functionality is here: https://wearezeta.atlassian.net/browse/SQPIT-1013. If not now, we have a reminder to do this later.
# Grace period for terminating coturn pods, after which they will be forcibly | ||
# terminated. This setting is only effective when coturnGracefulTermination is | ||
# set to true. | ||
coturnGracePeriodSeconds: 86400 # one day |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a statefulset, by default it would upgrade pods one by one. So, this number will get multiplied by number of instances. Maybe it is better to either make this a deployment
or set some updateStrategy
in the statefulset to make things go faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't have more than one instance running at the same time, as they'd bind on the same ports in the host namespace.
We can probably set .spec.updateStrategy.rollingUpdate.maxUnavailable
to more than the default (1) to allow faster updates - but on the other hand, the other TURN nodes need to be able to cope with the additional traffic.
# relies on external traffic management (e.g. service discovery for active coturn | ||
# instances) to prevent clients from sending new requests to pods which are in a | ||
# terminating state. | ||
coturnGracefulTermination: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how complicated it is to push for changes into coturn, but I suspect it is not a cake walk. I think it is fine to write hacks to get things working and then push for changes upstream.
So, unless doing this poses some significant problems for Wire users or operators, IMHO we should create another issue for fixing the upstream and merge this.
This is the deployment side of wireapp/coturn#5. This change adds additional options to the coturn Helm chart to enable graceful restart of coturn pods when used with a suitable service discovery mechanism. This relies on a post-stop hook script shipped in the (at time of writing) newest version of our coturn image to monitor when the number of active allocations on a coturn instance goes to zero, and blocking pod termination until this condition is met (subject to a grace period timeout).
Checklist
make git-add-cassandra-schema
to update the cassandra schema documentation.changelog.d
.