K8SPXC-2206 Add configurable HAProxy health check parameters #2207

timstoop · 2025-10-05T21:58:32Z

CHANGE DESCRIPTION

Problem:

When a PXC node fails (e.g., during rolling restart), HAProxy takes 20+ seconds to detect the failure with default settings (check inter 10000 + fall 2 = 20s). Worse, existing client connections to the failed backend are NOT terminated, causing them to hang until TCP timeout (potentially minutes).

The only workaround is to provide a complete HAProxy configuration via haproxy.configuration, which duplicates operator logic, breaks on upgrades, and is difficult to maintain.

Cause:

HAProxy backend health check parameters (interval, fall, rise, on-marked-down shutdown-sessions) are hardcoded in the operator and not exposed through the CR API.

Solution:

Add a new healthCheck field to the HAProxy spec allowing granular control:

haproxy:
  healthCheck:
    interval: 3000          # ms between checks (default: 10000)
    fall: 2                 # consecutive failures (default: 2)
    rise: 1                 # consecutive successes (default: 1)
    shutdownOnMarkDown: true # terminate connections (default: false)

This enables fast failover (6s vs 20s) and active connection cleanup without overriding the entire configuration.

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?
Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Are all needed new/changed options added to the Helm Chart?
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported PXC version?
Does the change support oldest and newest supported Kubernetes version?

Additional Notes:

Fully backwards compatible (optional field with safe defaults)
Comprehensive unit test coverage
CRDs regenerated with validation rules

Fixes #2206

it-percona-cla · 2025-10-05T21:58:37Z

All committers have signed the CLA.

timstoop · 2025-10-06T08:04:31Z

I'm btw happy to add this change to the Helm chart and the docs as well, but wanted to wait with those PRs until I knew for sure this would be accepted.

Add support for configuring HAProxy backend health check parameters without requiring full configuration override via haproxy.configuration. This allows users to: - Customize health check interval (default: 10000ms) - Configure fail threshold (default: 2 consecutive failures) - Configure rise threshold (default: 1 consecutive success) - Enable automatic connection termination on backend failure Benefits: - Faster failover detection (e.g., 6s with interval: 3000, fall: 2) - Active connection cleanup when backends fail - No need to duplicate operator's HAProxy configuration - Survives operator upgrades Implementation: - Added HAProxyHealthCheckSpec to CR API - Environment variables passed to haproxy_add_pxc_nodes.sh - Backwards compatible with existing deployments - Comprehensive test coverage Example usage: ```yaml haproxy: healthCheck: interval: 3000 fall: 2 rise: 1 shutdownOnMarkDown: true ``` Fixes percona#2206

hors · 2025-10-06T09:18:48Z

@timstoop haproxy does not have on-marked-down shutdown-sessions at all now :( I am going to add it via https://github.com/percona/percona-xtradb-cluster-operator/pull/2205/files . Do you want to be able to enable/disable it? From my point of view, we can just add it by default, but maybe you have a different use case.

timstoop · 2025-10-06T10:53:01Z

I totally agree that it makes sense to have it as default, it's what we were planning on doing in our Helm default code anyway. I only made it togglable in case there were good reasons that I didn't see and I really wanted to have this change implemented :-) I totally missed the 2205 PR, no idea how that could happen.
Do you want me to remove it from this patch and only use this PR to allow setting of the health checks instead?

- Add -r flag to read command to prevent backslash mangling - Add quotes around variable in echo command - Replace x-prefix comparison with direct comparison - Replace -a operator with && for better POSIX compliance

Add HA_SERVER_OPTIONS environment variable to all HAProxy statefulset compare files for e2e tests. This env var was added in the main feature and needs to be present in the expected test outputs. Tests affected: - haproxy - upgrade-haproxy - proxy-protocol - monitoring-2-0 - monitoring-pmm3 - default-cr

HA_SERVER_OPTIONS should come before REPLICAS_SVC_ONLY_READERS in the env list.

hors · 2025-10-06T19:54:08Z

I totally agree that it makes sense to have it as default, it's what we were planning on doing in our Helm default code anyway. I only made it togglable in case there were good reasons that I didn't see and I really wanted to have this change implemented :-) I totally missed the 2205 PR, no idea how that could happen. Do you want me to remove it from this patch and only use this PR to allow setting of the health checks instead?

yes, please

hors · 2025-10-06T19:55:08Z

build/haproxy_add_pxc_nodes.sh

 	firs_node_replica=''
 	main_node=''

 	SERVER_OPTIONS=${HA_SERVER_OPTIONS:-'resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1'}


What if I want to set custom HA_SERVER_OPTIONS via a secret? Will it work with your changes?

Right, that would break. Fixed in my latest commit!

As requested in PR review, this feature is being added separately in PR percona#2205. Also fixed HA_SERVER_OPTIONS to allow custom values via secret: - When healthCheck is not configured in the CR, HA_SERVER_OPTIONS env var is not set by the operator, allowing it to be customized via the haproxy-env-vars secret or use the script's default value - When healthCheck is configured, the operator generates HA_SERVER_OPTIONS with the specified interval, rise, and fall values Changes: - Removed ShutdownOnMarkDown field from HAProxyHealthCheckSpec API - Removed HA_SHUTDOWN_ON_MARK_DOWN env var and shutdown_on_mark_down logic - Updated CRDs and generated code - Updated tests - Updated example CR documentation

timstoop · 2025-10-06T22:09:14Z

Removed on marked down shutdown sessions option, as requested.

Also fixed the handling of HA_SERVER_OPTIONS via a secret.

The code generates env vars in this order: 1. PXC_SERVICE 2. HA_SERVER_OPTIONS (from buildHAProxyHealthCheckEnvVars) 3. IS_PROXY_PROTOCOL (added later if proxy protocol is configured) 4. REPLICAS_SVC_ONLY_READERS Updated compare files to match the generated order.

JNKPercona · 2025-10-10T11:37:53Z

Test Name	Result	Time
affinity-8-0	failure	00:03:10
auto-tuning-8-0	passed	00:20:50
cross-site-8-0	failure	00:08:49
custom-users-8-0	failure	00:07:31
demand-backup-cloud-8-0	failure	00:07:12
demand-backup-encrypted-with-tls-8-0	failure	00:12:16
demand-backup-8-0	failure	00:11:59
demand-backup-flow-control-8-0	failure	00:36:59
demand-backup-parallel-8-0	failure	00:07:28
demand-backup-without-passwords-8-0	failure	00:35:33
haproxy-5-7	failure	00:45:11
haproxy-8-0	failure	00:44:43
init-deploy-5-7	failure	00:04:55
init-deploy-8-0	failure	00:04:51
limits-8-0	failure	00:07:52
monitoring-2-0-8-0	failure	00:14:57
monitoring-pmm3-8-0	failure	00:25:00
one-pod-5-7	failure	00:04:27
one-pod-8-0	failure	00:04:18
pitr-8-0	failure	00:03:06
pitr-gap-errors-8-0	failure	00:10:00
proxy-protocol-8-0	failure	00:08:43
proxysql-sidecar-res-limits-8-0	passed	00:09:34
pvc-resize-5-7	failure	00:39:23
pvc-resize-8-0	failure	00:39:50
recreate-8-0	passed	00:16:16
restore-to-encrypted-cluster-8-0	failure	00:06:37
scaling-proxysql-8-0	passed	00:11:47
scaling-8-0	passed	00:13:13
scheduled-backup-5-7	failure	00:15:37
scheduled-backup-8-0	failure	00:07:58
security-context-8-0	failure	00:10:18
smart-update1-8-0	failure	00:03:50
smart-update2-8-0	failure	00:14:50
storage-8-0	passed	00:15:09
tls-issue-cert-manager-ref-8-0	failure	00:04:58
tls-issue-cert-manager-8-0	failure	00:45:57
tls-issue-self-8-0	failure	00:36:44
upgrade-consistency-8-0	failure	00:39:32
upgrade-haproxy-5-7	failure	00:14:51
upgrade-haproxy-8-0	failure	00:14:32
upgrade-proxysql-5-7	passed	00:14:50
upgrade-proxysql-8-0	passed	00:28:54
users-5-7	failure	00:07:37
users-8-0	passed	00:25:01
validation-hook-8-0	passed	00:01:55
We run 46 out of 46		12:49:26

commit: f8e092d
image: perconalab/percona-xtradb-cluster-operator:PR-2207-f8e092d7

timstoop requested review from egegunes, gkech, hors, nmarukovich and pooknull as code owners October 5, 2025 21:58

pull-request-size bot added the size/L 100-499 lines label Oct 5, 2025

timstoop changed the title ~~Add configurable HAProxy health check parameters~~ K8SPXC-2206 Add configurable HAProxy health check parameters Oct 5, 2025

timstoop force-pushed the K8SPXC-2206-haproxy-health-check-config branch from bc23e2f to cb1bef1 Compare October 6, 2025 06:50

egegunes added the community label Oct 6, 2025

timstoop force-pushed the K8SPXC-2206-haproxy-health-check-config branch from cb1bef1 to c4cc754 Compare October 6, 2025 09:08

timstoop added 2 commits October 6, 2025 13:05

Fix shellcheck warnings in haproxy_add_pxc_nodes.sh

b6e4145

- Add -r flag to read command to prevent backslash mangling - Add quotes around variable in echo command - Replace x-prefix comparison with direct comparison - Replace -a operator with && for better POSIX compliance

timstoop requested review from eleo007, jvpasinatto and valmiranogueira as code owners October 6, 2025 15:28

Fix environment variable ordering in compare files

4179786

HA_SERVER_OPTIONS should come before REPLICAS_SVC_ONLY_READERS in the env list.

hors reviewed Oct 6, 2025

View reviewed changes

pull-request-size bot added size/XXL 1000+ lines and removed size/L 100-499 lines labels Oct 6, 2025

timstoop and others added 2 commits October 7, 2025 00:13

Merge branch 'main' into K8SPXC-2206-haproxy-health-check-config

f8e092d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

K8SPXC-2206 Add configurable HAProxy health check parameters #2207

K8SPXC-2206 Add configurable HAProxy health check parameters #2207

timstoop commented Oct 5, 2025 •

edited by pull-request-badge bot

Loading

Uh oh!

it-percona-cla commented Oct 5, 2025 •

edited

Loading

Uh oh!

timstoop commented Oct 6, 2025

Uh oh!

hors commented Oct 6, 2025

Uh oh!

timstoop commented Oct 6, 2025

Uh oh!

hors commented Oct 6, 2025

Uh oh!

hors Oct 6, 2025

Uh oh!

timstoop Oct 6, 2025

Uh oh!

timstoop commented Oct 6, 2025

Uh oh!

JNKPercona commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

K8SPXC-2206 Add configurable HAProxy health check parameters #2207

Are you sure you want to change the base?

K8SPXC-2206 Add configurable HAProxy health check parameters #2207

Conversation

timstoop commented Oct 5, 2025 • edited by pull-request-badge bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CHANGE DESCRIPTION

CHECKLIST

Uh oh!

it-percona-cla commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timstoop commented Oct 6, 2025

Uh oh!

hors commented Oct 6, 2025

Uh oh!

timstoop commented Oct 6, 2025

Uh oh!

hors commented Oct 6, 2025

Uh oh!

hors Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

timstoop Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

timstoop commented Oct 6, 2025

Uh oh!

JNKPercona commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

timstoop commented Oct 5, 2025 •

edited by pull-request-badge bot

Loading

it-percona-cla commented Oct 5, 2025 •

edited

Loading