Skip to content

Conversation

timstoop
Copy link

@timstoop timstoop commented Oct 5, 2025

K8SPXC-2206 Powered by Pull Request Badge

CHANGE DESCRIPTION

Problem:

When a PXC node fails (e.g., during rolling restart), HAProxy takes 20+ seconds to detect the failure with default settings (check inter 10000 + fall 2 = 20s). Worse, existing client connections to the failed backend are NOT terminated, causing them to hang until TCP timeout (potentially minutes).

The only workaround is to provide a complete HAProxy configuration via haproxy.configuration, which duplicates operator logic, breaks on upgrades, and is difficult to maintain.

Cause:

HAProxy backend health check parameters (interval, fall, rise, on-marked-down shutdown-sessions) are hardcoded in the operator and not exposed through the CR API.

Solution:

Add a new healthCheck field to the HAProxy spec allowing granular control:

haproxy:
  healthCheck:
    interval: 3000          # ms between checks (default: 10000)
    fall: 2                 # consecutive failures (default: 2)
    rise: 1                 # consecutive successes (default: 1)
    shutdownOnMarkDown: true # terminate connections (default: false)

This enables fast failover (6s vs 20s) and active connection cleanup without overriding the entire configuration.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?
  • Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PXC version?
  • Does the change support oldest and newest supported Kubernetes version?

Additional Notes:

  • Fully backwards compatible (optional field with safe defaults)
  • Comprehensive unit test coverage
  • CRDs regenerated with validation rules

Fixes #2206

@it-percona-cla
Copy link

it-percona-cla commented Oct 5, 2025

CLA assistant check
All committers have signed the CLA.

@pull-request-size pull-request-size bot added the size/L 100-499 lines label Oct 5, 2025
@timstoop timstoop changed the title Add configurable HAProxy health check parameters K8SPXC-2206 Add configurable HAProxy health check parameters Oct 5, 2025
@timstoop timstoop force-pushed the K8SPXC-2206-haproxy-health-check-config branch from bc23e2f to cb1bef1 Compare October 6, 2025 06:50
@timstoop
Copy link
Author

timstoop commented Oct 6, 2025

I'm btw happy to add this change to the Helm chart and the docs as well, but wanted to wait with those PRs until I knew for sure this would be accepted.

Add support for configuring HAProxy backend health check parameters
without requiring full configuration override via haproxy.configuration.

This allows users to:
- Customize health check interval (default: 10000ms)
- Configure fail threshold (default: 2 consecutive failures)
- Configure rise threshold (default: 1 consecutive success)
- Enable automatic connection termination on backend failure

Benefits:
- Faster failover detection (e.g., 6s with interval: 3000, fall: 2)
- Active connection cleanup when backends fail
- No need to duplicate operator's HAProxy configuration
- Survives operator upgrades

Implementation:
- Added HAProxyHealthCheckSpec to CR API
- Environment variables passed to haproxy_add_pxc_nodes.sh
- Backwards compatible with existing deployments
- Comprehensive test coverage

Example usage:
```yaml
haproxy:
  healthCheck:
    interval: 3000
    fall: 2
    rise: 1
    shutdownOnMarkDown: true
```

Fixes percona#2206
@timstoop timstoop force-pushed the K8SPXC-2206-haproxy-health-check-config branch from cb1bef1 to c4cc754 Compare October 6, 2025 09:08
@hors
Copy link
Collaborator

hors commented Oct 6, 2025

@timstoop haproxy does not have on-marked-down shutdown-sessions at all now :( I am going to add it via https://github.com/percona/percona-xtradb-cluster-operator/pull/2205/files . Do you want to be able to enable/disable it? From my point of view, we can just add it by default, but maybe you have a different use case.

@timstoop
Copy link
Author

timstoop commented Oct 6, 2025

I totally agree that it makes sense to have it as default, it's what we were planning on doing in our Helm default code anyway. I only made it togglable in case there were good reasons that I didn't see and I really wanted to have this change implemented :-) I totally missed the 2205 PR, no idea how that could happen.
Do you want me to remove it from this patch and only use this PR to allow setting of the health checks instead?

- Add -r flag to read command to prevent backslash mangling
- Add quotes around variable in echo command
- Replace x-prefix comparison with direct comparison
- Replace -a operator with && for better POSIX compliance
Add HA_SERVER_OPTIONS environment variable to all HAProxy statefulset
compare files for e2e tests. This env var was added in the main feature
and needs to be present in the expected test outputs.

Tests affected:
- haproxy
- upgrade-haproxy
- proxy-protocol
- monitoring-2-0
- monitoring-pmm3
- default-cr
HA_SERVER_OPTIONS should come before REPLICAS_SVC_ONLY_READERS in the env list.
@hors
Copy link
Collaborator

hors commented Oct 6, 2025

I totally agree that it makes sense to have it as default, it's what we were planning on doing in our Helm default code anyway. I only made it togglable in case there were good reasons that I didn't see and I really wanted to have this change implemented :-) I totally missed the 2205 PR, no idea how that could happen. Do you want me to remove it from this patch and only use this PR to allow setting of the health checks instead?

yes, please

firs_node_replica=''
main_node=''

SERVER_OPTIONS=${HA_SERVER_OPTIONS:-'resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1'}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I want to set custom HA_SERVER_OPTIONS via a secret? Will it work with your changes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that would break. Fixed in my latest commit!

As requested in PR review, this feature is being added separately in PR percona#2205.

Also fixed HA_SERVER_OPTIONS to allow custom values via secret:
- When healthCheck is not configured in the CR, HA_SERVER_OPTIONS env var
  is not set by the operator, allowing it to be customized via the
  haproxy-env-vars secret or use the script's default value
- When healthCheck is configured, the operator generates HA_SERVER_OPTIONS
  with the specified interval, rise, and fall values

Changes:
- Removed ShutdownOnMarkDown field from HAProxyHealthCheckSpec API
- Removed HA_SHUTDOWN_ON_MARK_DOWN env var and shutdown_on_mark_down logic
- Updated CRDs and generated code
- Updated tests
- Updated example CR documentation
@pull-request-size pull-request-size bot added size/XXL 1000+ lines and removed size/L 100-499 lines labels Oct 6, 2025
@timstoop
Copy link
Author

timstoop commented Oct 6, 2025

Removed on marked down shutdown sessions option, as requested.

Also fixed the handling of HA_SERVER_OPTIONS via a secret.

timstoop and others added 2 commits October 7, 2025 00:13
The code generates env vars in this order:
1. PXC_SERVICE
2. HA_SERVER_OPTIONS (from buildHAProxyHealthCheckEnvVars)
3. IS_PROXY_PROTOCOL (added later if proxy protocol is configured)
4. REPLICAS_SVC_ONLY_READERS

Updated compare files to match the generated order.
@JNKPercona
Copy link
Collaborator

Test Name Result Time
affinity-8-0 failure 00:03:10
auto-tuning-8-0 passed 00:20:50
cross-site-8-0 failure 00:08:49
custom-users-8-0 failure 00:07:31
demand-backup-cloud-8-0 failure 00:07:12
demand-backup-encrypted-with-tls-8-0 failure 00:12:16
demand-backup-8-0 failure 00:11:59
demand-backup-flow-control-8-0 failure 00:36:59
demand-backup-parallel-8-0 failure 00:07:28
demand-backup-without-passwords-8-0 failure 00:35:33
haproxy-5-7 failure 00:45:11
haproxy-8-0 failure 00:44:43
init-deploy-5-7 failure 00:04:55
init-deploy-8-0 failure 00:04:51
limits-8-0 failure 00:07:52
monitoring-2-0-8-0 failure 00:14:57
monitoring-pmm3-8-0 failure 00:25:00
one-pod-5-7 failure 00:04:27
one-pod-8-0 failure 00:04:18
pitr-8-0 failure 00:03:06
pitr-gap-errors-8-0 failure 00:10:00
proxy-protocol-8-0 failure 00:08:43
proxysql-sidecar-res-limits-8-0 passed 00:09:34
pvc-resize-5-7 failure 00:39:23
pvc-resize-8-0 failure 00:39:50
recreate-8-0 passed 00:16:16
restore-to-encrypted-cluster-8-0 failure 00:06:37
scaling-proxysql-8-0 passed 00:11:47
scaling-8-0 passed 00:13:13
scheduled-backup-5-7 failure 00:15:37
scheduled-backup-8-0 failure 00:07:58
security-context-8-0 failure 00:10:18
smart-update1-8-0 failure 00:03:50
smart-update2-8-0 failure 00:14:50
storage-8-0 passed 00:15:09
tls-issue-cert-manager-ref-8-0 failure 00:04:58
tls-issue-cert-manager-8-0 failure 00:45:57
tls-issue-self-8-0 failure 00:36:44
upgrade-consistency-8-0 failure 00:39:32
upgrade-haproxy-5-7 failure 00:14:51
upgrade-haproxy-8-0 failure 00:14:32
upgrade-proxysql-5-7 passed 00:14:50
upgrade-proxysql-8-0 passed 00:28:54
users-5-7 failure 00:07:37
users-8-0 passed 00:25:01
validation-hook-8-0 passed 00:01:55
We run 46 out of 46 12:49:26

commit: f8e092d
image: perconalab/percona-xtradb-cluster-operator:PR-2207-f8e092d7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add configurable HAProxy backend health check parameters without requiring full configuration override

5 participants