Skip to content

Conversation

sysvinit
Copy link
Contributor

@sysvinit sysvinit commented Jun 7, 2022

This is the deployment side of wireapp/coturn#5. This change adds additional options to the coturn Helm chart to enable graceful restart of coturn pods when used with a suitable service discovery mechanism. This relies on a post-stop hook script shipped in the (at time of writing) newest version of our coturn image to monitor when the number of active allocations on a coturn instance goes to zero, and blocking pod termination until this condition is met (subject to a grace period timeout).

Checklist

  • The PR Title explains the impact of the change.
  • The PR description provides context as to why the change should occur and what the code contributes to that effect. This could also be a link to a JIRA ticket or a Github issue, if there is one.
  • If this PR changes development workflow or dependencies, they have been A) automated and B) documented under docs/developer/. All efforts have been taken to minimize development setup breakage or slowdown for co-workers.
  • If HTTP endpoint paths have been added or renamed, or feature configs have changed, the endpoint / config-flag checklist (see Wire-employee only backend wiki page) has been followed.
  • If a cassandra schema migration has been added, I ran make git-add-cassandra-schema to update the cassandra schema documentation.
  • changelog.d contains the following bits of information (details):
    • A file with the changelog entry in one or more suitable sub-sections. The sub-sections are marked by directories inside changelog.d.
    • If new config options introduced: added usage description under docs/reference/config-options.md
    • If new config options introduced: recommended measures to be taken by on-premise instance operators.
    • If a cassandra schema migration is backwards incompatible (see also these docs), measures to be taken by on-premise instance operators are explained.
    • If a data migration (not schema migration) introduced: measures to be taken by on-premise instance operators.
    • If public end-points have been changed or added: does nginz need un upgrade?
    • If internal end-points have been added or changed: which services have to be deployed in a specific order?

@sysvinit sysvinit temporarily deployed to cachix June 7, 2022 16:01 Inactive
sysvinit added 3 commits June 13, 2022 11:31
This adds a preStop hook to coturn pods which waits until the number of
active allocations goes to zero before the coturn process is signalled to
terminate. This can be used with an external service discovery mechanism to
enable graceful rolling restarts, by removing each running coturn pod from the
set of active advertised coturn instances in turn.
Starting with this version, our coturn image ships with the pre-stop hook
script, which is required for graceful termination.
@sysvinit sysvinit force-pushed the sysvinit/coturn-graceful-termination branch from 9935711 to e212919 Compare June 13, 2022 10:37
@sysvinit sysvinit temporarily deployed to cachix June 13, 2022 10:37 Inactive
@sysvinit sysvinit marked this pull request as ready for review June 13, 2022 10:44
Comment on lines +28 to +33
# This chart optionally supports waiting for traffic to drain from coturn
# before pods are terminated. Warning: coturn does not have any way to steer
# incoming client traffic away from itself on its own, so this functionality
# relies on external traffic management (e.g. service discovery for active coturn
# instances) to prevent clients from sending new requests to pods which are in a
# terminating state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During draining, shouldn't the endpoint slice of the service backing the statefulset have the draining endpoints with Ready being set to false?

So if that's used for service discovery, it should already work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if that's used for service discovery, it should already work?

This should indeed just work in environments which have Kube-aware service discovery. @julialongtin tells me, however, that this generally is not the case in on-premise environments, and that the preference is to set things like the TURN endpoints statically. I'm hesitant to assume too much Kube prior knowledge in something we're shipping to customers, and I'd rather have it explicitly stated that while this graceful termination functionality exists, it needs extra moving parts in order to work correctly (which might not be feasible to implement in all environments).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, on-prem environments might just have the IP adresses/hostnames hardcoded, and don't actually take things out of the rotation.

So it'd probably even be more important for coturn to do some more graceful termination when receiving SIGTERM, by just not accepting new connections when receiving a SIGTERM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does restund do any sort of graceful termination on e.g. SIGTERM? My current understanding is that the failover strategy we have for restund is similar to what I'm trying to achieve here, that being we rely on fiddling with service discovery elsewhere to rotate instances out of the advertised instance group, and then watch the metrics until traffic has drained from the instance we're removing before we kill it off.

In the first instance I'm trying to implement automation for coturn which is at least approximately equally as bad as what we have for restund already, which can then be the basis for incremental future development. As I mentioned in the other review thread, this chart is currently marked as alpha-quality, so I'm not really interested in bikeshedding the perfect coturn-on-k8s management solution here. That being said, if you want write some patches against coturn to introduce application-level graceful termination functionality then that would certainly not be unwelcome, and would address the points you've raised in the other review thread.

# relies on external traffic management (e.g. service discovery for active coturn
# instances) to prevent clients from sending new requests to pods which are in a
# terminating state.
coturnGracefulTermination: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably have this feature enabled by default, and make a disableCoturnGracefulTermination defaulting to false.

Things like nginx also gracefully terminate and don't just blindly kill all connections when retrieving SIGTERM - the fact we're using a preStop script to drain is an implementation detail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably have this feature enabled by default, and make a disableCoturnGracefulTermination defaulting to false.

I'm not sure which of [enabled by default]/[disabled by default] is the least surprising option here. I would prefer to keep the default behaviour of the coturn chart close to that of the restund chart (which does not have any graceful termination functionality), given that the former should eventually replace the latter. Enabling this functionality increases the runtime complexity/number of moving parts, and I'm not sure whether that's a good default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[…] Enabling this functionality increases the runtime complexity/number of moving parts […]

… and this is mostly due to wireapp/coturn#5 now ungracefully terminating connections on SIGTERM, which is the need for the additional moving parts (preStop) in first place.

Before wireapp/coturn#5, terminating endpoints in the service resource were marked with ready=False, so assuming service discovery honors that, it'd took these turn servers out of the list - and we'd automatically have some "draining period" where new connections wouldn't happen due to that. The only annoying part was coturn would entirely ignore SIGTERM and turning it off would take all of terminationGracePeriodSeconds, instead of just the time until the last client disconnected.

wireapp/coturn#5 only makes SIGTERM ungraceful - coturn "immediately"(during the next event loop run) exits itself, entirely ignoring there's still open connections - so we're now adding the preStop script to prevent it from doing it in an ungraceful way.

This means, with wireapp/coturn#5 merged, but users not explicitly setting coturnGracefulTermination: true, we're suddenly terminating connections way less graceful.

That's a regression IMHO.

We should either put some more graceful termination logic into coturn itself (and probably get this upstreamed), so we don't need to maintain complex preStop logic we're not very confident about downstream, or accept the fact that faster, but still graceful draining is more complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're starting to talk at cross-purposes here.

… and this is mostly due to wireapp/coturn#5 now ungracefully terminating connections on SIGTERM, which is the need for the additional moving parts (preStop) in first place.

No. wireapp/coturn#6 introduces the SIGTERM handler. wireapp/coturn#5 is the change which added the preStop hook script.

The preStop hook was introduced before the signal handling change was made, with the goal of postponing signals being sent to coturn until connections had drained, under incorrect assumptions about coturn's signal handling behaviour when running as PID 1.

The SIGTERM handler was subsequently introduced to ensure that coturn would exit correctly when signaled by the control plane (under the assumption that either the preStop hook has exited, or the grace period timeout has expired). I consider the addition of the SIGTERM handler to be a bugfix, as it means that coturn's signal handling behaviour is consistent irrespective of whether it runs as PID 1 or not.

Before wireapp/coturn#5, terminating endpoints in the service resource were marked with ready=False, so assuming service discovery honors that, it'd took these turn servers out of the list - and we'd automatically have some "draining period" where new connections wouldn't happen due to that. The only annoying part was coturn would entirely ignore SIGTERM and turning it off would take all of terminationGracePeriodSeconds, instead of just the time until the last client disconnected.

This is not correct if you are not using the modifications to the Helm chart in this PR. Before this PR, there is no means to configure the termination grace period or enable the preStop hook. Hence, the default grace period of 30 second applies (see here and scroll down a bit).

  • If you're using coturn without the SIGTERM handler fix with the coturn Helm chart currently in develop, then when the control plane requests the termination of a coturn pod, the coturn process will ignore the initial SIGTERM, and then be forcibly killed 30 seconds later when the control plane delivers SIGKILL.

  • If you're using coturn with the SIGTERM handler fix with the develop Helm chart, then when the control plane requests the termination of a coturn pod, the coturn process will exit upon receipt of the initial SIGTERM.

In both of these cases, coturn dies relatively promptly after the control plane requests pod termination, and in both cases this happens regardless of the client load on a given coturn pod.

Now, considering the changes in this PR -- this expects that the coturn image contains the preStop hook script, so when the config flag in the Helm chart is enabled it will actually work correctly (ref: e212919). Therefore we eliminate the case where the version of the Helm chart in this PR is using an old version of coturn which has neither the hook script nor the SIGTERM fix.

If the operator has coturnGracefulTermination set to false, then when a coturn pod is terminated, it exits immediately on SIGTERM, as with the second case I described above. If this option is set to true, then you hit the extra logic, where the coturn pod may continue running for up to coturnGracePeriodSeconds after termination is requested during which time it is not marked as Ready in the pod status reported by the control plane.

This means, with wireapp/coturn#5 merged, but users not explicitly setting coturnGracefulTermination: true, we're suddenly terminating connections way less graceful.

I disagree. The changes introduced by this PR, in their default configuration, do not significantly alter the termination behaviour of the services defined in this chart in comparison to the version of the chart preceding this PR.


I would also offer a gentle reminder that the coturn chart is currently marked in the tree as alpha-quality, and on this basis I think that significant changes to the chart's behaviour at this stage are within scope if we want to make them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I might have confused the two PR numbers. It was a bit confusing the preStop script came before the SIGTERM handler.

My stance is that handling SIGTERM in coturn shouldn't be ungraceful in first place - it's confusingly unintuitive, compared to how other applications handle it. They usually switch into a mode where they don't accept new connections, and properly exit once the still-open connections are done (and usually whatever's supervising the process, be it kubernetes, systemd, or an impatient operator sending SIGKILL after a timeout).

So with that implemented, wireapp/coturn#5 and the preStop hook in this PR here wouldn't be necessary at all.

It's probably best to carry the discussion on how SIGTERM should be handled upstream. We don't want to keep maintaining a coturn fork downstream anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I think I understand your point better now.

Should I interpret your commentary here as an outright rejection of this PR? Or do we merge this, fix coturn, and then revise the Helm chart here afterwards?

I do feel that we're getting into "perfect is the enemy of the good" territory here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should merge this, and should undo wireapp/coturn#5 and wireapp/coturn#6 until SIGTERM is handled gracefully.

I do feel that we're getting into "perfect is the enemy of the good" territory here.

If SIGTERM is handled gracefully, we don't need any preStop hooks, and as long as it isn't (and coturn plainly ignores SIGTERM), it's probably fine to just wait terminationGracePeriodSeconds even if no clients are connected.

This overall feels way less brittle, and a graceful SIGTERM handler is probably much more likely to get upstreamed as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how complicated it is to push for changes into coturn, but I suspect it is not a cake walk. I think it is fine to write hacks to get things working and then push for changes upstream.
So, unless doing this poses some significant problems for Wire users or operators, IMHO we should create another issue for fixing the upstream and merge this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should merge this, and should undo wireapp/coturn#5 and wireapp/coturn#6 until SIGTERM is handled gracefully.

Tracking ticket for this functionality is here: https://wearezeta.atlassian.net/browse/SQPIT-1013. If not now, we have a reminder to do this later.

Comment on lines +35 to +38
# Grace period for terminating coturn pods, after which they will be forcibly
# terminated. This setting is only effective when coturnGracefulTermination is
# set to true.
coturnGracePeriodSeconds: 86400 # one day
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a statefulset, by default it would upgrade pods one by one. So, this number will get multiplied by number of instances. Maybe it is better to either make this a deployment or set some updateStrategy in the statefulset to make things go faster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't have more than one instance running at the same time, as they'd bind on the same ports in the host namespace.

We can probably set .spec.updateStrategy.rollingUpdate.maxUnavailable to more than the default (1) to allow faster updates - but on the other hand, the other TURN nodes need to be able to cope with the additional traffic.

# relies on external traffic management (e.g. service discovery for active coturn
# instances) to prevent clients from sending new requests to pods which are in a
# terminating state.
coturnGracefulTermination: false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how complicated it is to push for changes into coturn, but I suspect it is not a cake walk. I think it is fine to write hacks to get things working and then push for changes upstream.
So, unless doing this poses some significant problems for Wire users or operators, IMHO we should create another issue for fixing the upstream and merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants