Skip to content

Client connection failures during Kafka rolling upgrade when broker listener configuration changes #5176

@DDDFiish

Description

@DDDFiish

Description

During a rolling upgrade of a multi-node Kafka cluster, we change the broker listener configuration through several steps, and restart brokers one by one. After the final listener configuration is applied and brokers are restarted, we observe that Kafka clients using librdkafka experience connection failures until the client process is restarted.

Environment

  • Kafka version: 3.9.0
  • librdkafka version: below 2.10.0

Upgrade Steps

We apply the following configuration changes step by step, restarting brokers after each change:

  1. listener.security.protocol.map=BROKER:SASL_PLAINTEXT,CONTROLLER:SASL_PLAINTEXT,SASL_PLAINTEXT:SASL_PLAINTEXT
    listeners=SASL_PLAINTEXT://<hostname>:9092,BROKER://<hostname>:9094
    inter.broker.listener.name=SASL_PLAINTEXT
    
  2. listener.security.protocol.map=BROKER:SASL_PLAINTEXT,CONTROLLER:SASL_PLAINTEXT,SASL_PLAINTEXT:SASL_PLAINTEXT
    listeners=SASL_PLAINTEXT://<hostname>:9092,BROKER://<hostname>:9094
    inter.broker.listener.name=BROKER
    
  3. listener.security.protocol.map=BROKER:SASL_PLAINTEXT,CONTROLLER:SASL_PLAINTEXT,SASL_PLAINTEXT:SASL_PLAINTEXT
    listeners=BROKER://<hostname>:9092,SASL_PLAINTEXT://<hostname>:9094
    inter.broker.listener.name=BROKER
    
  4. listener.security.protocol.map=BROKER:SASL_PLAINTEXT,CONTROLLER:SASL_PLAINTEXT,SASL_PLAINTEXT:SASL_PLAINTEXT
    listeners=BROKER://<hostname>:9092
    inter.broker.listener.name=BROKER
    

After step 4, when a broker is restarted, clients start reporting connection errors such as:

BrokerTransportFailure (Local: Broker transport failure): sasl_plaintext://khazad13:9092/167843919: Connection setup timed out in state CONNECT (after 30029ms in state CONNECT, 1 identical error(s) suppressed)
Connect to ipv4#[10.1.24.76:9094] failed: Connection refused (after 0ms in state CONNECT, 6 identical error(s) suppressed)

The issue appears to be that the listener name has changed, but the Kafka client is still trying to connect to the previous node (as shown in the example, attempting to connect to khazad13:9094, when this server no longer exists).
The errors persist until the client process itself is restarted, after which everything works fine.

Observations

Questions

  • Is this client-side connection failure expected during rolling upgrade with listener changes?
  • Is a client restart the only workaround for older librdkafka versions?
  • Is this issue considered resolved in recent librdkafka versions, or are there recommended best practices for Kafka upgrades involving listener changes?

Thanks in advance for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions