-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Description
Description
During a rolling upgrade of a multi-node Kafka cluster, we change the broker listener configuration through several steps, and restart brokers one by one. After the final listener configuration is applied and brokers are restarted, we observe that Kafka clients using librdkafka experience connection failures until the client process is restarted.
Environment
- Kafka version: 3.9.0
- librdkafka version: below 2.10.0
Upgrade Steps
We apply the following configuration changes step by step, restarting brokers after each change:
-
listener.security.protocol.map=BROKER:SASL_PLAINTEXT,CONTROLLER:SASL_PLAINTEXT,SASL_PLAINTEXT:SASL_PLAINTEXT listeners=SASL_PLAINTEXT://<hostname>:9092,BROKER://<hostname>:9094 inter.broker.listener.name=SASL_PLAINTEXT -
listener.security.protocol.map=BROKER:SASL_PLAINTEXT,CONTROLLER:SASL_PLAINTEXT,SASL_PLAINTEXT:SASL_PLAINTEXT listeners=SASL_PLAINTEXT://<hostname>:9092,BROKER://<hostname>:9094 inter.broker.listener.name=BROKER -
listener.security.protocol.map=BROKER:SASL_PLAINTEXT,CONTROLLER:SASL_PLAINTEXT,SASL_PLAINTEXT:SASL_PLAINTEXT listeners=BROKER://<hostname>:9092,SASL_PLAINTEXT://<hostname>:9094 inter.broker.listener.name=BROKER -
listener.security.protocol.map=BROKER:SASL_PLAINTEXT,CONTROLLER:SASL_PLAINTEXT,SASL_PLAINTEXT:SASL_PLAINTEXT listeners=BROKER://<hostname>:9092 inter.broker.listener.name=BROKER
After step 4, when a broker is restarted, clients start reporting connection errors such as:
BrokerTransportFailure (Local: Broker transport failure): sasl_plaintext://khazad13:9092/167843919: Connection setup timed out in state CONNECT (after 30029ms in state CONNECT, 1 identical error(s) suppressed)
Connect to ipv4#[10.1.24.76:9094] failed: Connection refused (after 0ms in state CONNECT, 6 identical error(s) suppressed)
The issue appears to be that the listener name has changed, but the Kafka client is still trying to connect to the previous node (as shown in the example, attempting to connect to khazad13:9094, when this server no longer exists).
The errors persist until the client process itself is restarted, after which everything works fine.
Observations
- The issue occurs with older versions of librdkafka (e.g. 2.8.0).
- With librdkafka 2.10.0, the issue still happens during the upgrade, but after all brokers are restarted, the clients recover without requiring a restart.
- According to the changelog, there are fixes related to broker identification and removal of unavailable brokers (Purge brokers no longer reported in metadata #4557, Code and tests fixes to make the full test suite pass #4970). Is this behavior expected? Has the issue been fully resolved in 2.10.0 or later?
Questions
- Is this client-side connection failure expected during rolling upgrade with listener changes?
- Is a client restart the only workaround for older librdkafka versions?
- Is this issue considered resolved in recent librdkafka versions, or are there recommended best practices for Kafka upgrades involving listener changes?
Thanks in advance for your help!