Code and tests fixes to make the full test suite pass #4970

emasab · 2025-02-18T18:02:21Z

Includes a task to run the test suite on demand on Semaphore CI.

A description can be found in each commit message.

confluent-cla-assistant · 2025-02-18T18:02:32Z

🎉 All Contributor License Agreements have been signed. Ready to merge.
_{Please push an empty commit if you would like to re-run the checks to verify CLA status for all contributors.}

…e pass

milindl

Leaving partial review comments.

milindl · 2025-02-27T12:17:58Z

tests/0068-produce_timeout.c

        rkt = test_create_producer_topic(rk, topic, "message.timeout.ms",
-                                         "2000", NULL);
+                                         "3000", NULL);

        TEST_SAY("Auto-creating topic %s\n", topic);
        test_auto_create_topic_rkt(rk, rkt, tmout_multip(5000));


This wouldn't be needed anymore as we're creating using mock broker

milindl · 2025-02-27T13:57:52Z

tests/0146-metadata_mock.c

+        case 1:
+                expected_metadata_requests = 7;
+                break;
+        default:


nit: Just fail the test if case isn't one of the expected variations instead, otherwise the person reading the code thinks this is a valid case too

tests/run-test-batches.py

milindl

Another partial review, 2 commits are still pending review.

milindl · 2025-03-03T03:44:59Z

tests/0093-holb.c

@@ -116,7 +116,12 @@ int main_0093_holb_consumer(int argc, char **argv) {

        test_conf_set(conf, "session.timeout.ms", "6000");
        test_conf_set(conf, "max.poll.interval.ms", "20000");
-        test_conf_set(conf, "socket.timeout.ms", "3000");
+        /* Socket timeout must be greater than


For the requests sent by the group coordinator (the joingroup request), don't we set request timeout differently?

/* Absolute timeout */ rd_kafka_buf_set_abs_timeout_force( rkbuf, /* Request timeout is max.poll.interval.ms + grace * if the broker supports it, else * session.timeout.ms + grace. */ (ApiVersion >= 1 ? rk->rk_conf.max_poll_interval_ms : rk->rk_conf.group_session_timeout_ms) + 3000 /* 3s grace period*/, 0);

In particular, isn't this test designed to test that while JoinGroup requests may block group coordinator connection for T, where socket.timeout.ms < T < max.poll.interval.ms, but it should not cause any failures?

From commit that added this test,
"Since JoinGroupRequests may block for up to max.poll.interval.ms,
which may be set very high (hours..), any sub-sequent requests
on the same connection, such as Metadata refreshes, would time out
and tear down the connection, triggering another rebalance."

I cannot reproduce this test failure anymore. I'll remove the test change. It's fair that the socket.timeout.ms shouldn't be used but max.poll.interval.ms be used instead. This values is still lower but it shouldn't be needed to increase it.

milindl · 2025-03-03T03:52:48Z

tests/0082-fetch_max_bytes.cpp

@@ -87,6 +87,9 @@ static void do_test_fetch_max_bytes(void) {
    Test::Fail("Failed to create KafkaConsumer: " + errstr);
  delete conf;

+  /* For next consumer */
+  test_wait_topic_exists(c->c_ptr(), topic.c_str(), 5000);


question: given that we've already produced to the said topic, this is pretty much just rd_sleep(1), right? To propagate the topic?

It's similar, at the moment, we may change how we wait a topic exists on all brokers by using JMX requests.

milindl · 2025-03-03T11:33:24Z

src/rdkafka_cgrp.c

+static int32_t
 rd_kafka_cgrp_subscription_set(rd_kafka_cgrp_t *rkcg,


nit: Add @brief description above this method, makes it a bit easier for the reader if it includes what we're returning too

milindl · 2025-03-04T02:56:38Z

src/rdkafka_cgrp.c

 rd_kafka_cgrp_subscription_set(rd_kafka_cgrp_t *rkcg,
                               rd_kafka_topic_partition_list_t *rktparlist) {
+        int32_t ret = rd_atomic32_add(&rkcg->rkcg_subscription_version, 1);


nit

Suggested change

int32_t ret = rd_atomic32_add(&rkcg->rkcg_subscription_version, 1);

int32_t new_subscription_version = rd_atomic32_add(&rkcg->rkcg_subscription_version, 1);

milindl · 2025-03-04T03:37:30Z

src/rdkafka_int.h

-        rd_kafka_metadata_internal_t
-            *rk_full_metadata;       /* Last full metadata. */
-        rd_ts_t rk_ts_full_metadata; /* Timestamp of .. */
+        rd_ts_t rk_ts_full_metadata;                      /* Timestamp of .. */


nit

/* Timestamp of most recent full metadata */

milindl · 2025-03-04T03:45:36Z

src/rdkafka_request.c

+                 * avoid retrying it on this same broker.
+                 * This is to prevent client is hung
+                 * until it can connect to this broker again. */
+                if (!request->rkbuf_u.Metadata.decr &&


Wouldn't we need to hold

Metadata.decr_lock

while reading it?

No, it's only needed if we need to decrease the integer it points to, that can be rkmc_full_brokers_sent or rkmc_full_topics_sent. Lock for it will be acquired when the buffer is destroyed. Can add a comment for this.

skip events generated before the assignment that lead to a test failure

- avoid full metadata refresh during metadata propagation time after topic creation - Rebalance events order after max.poll.interval.ms exceeded

…ssages verification. Log warnings for the errors to identify the cause.

calls cause an unknown topic or partition error

as for the `rd_kafka_toppar_t` that is also the same strategy as in Java client, considering: - when topic id changes partitions metadata is taken entirely from the new one. - when leader epoch is greater or equal to the one in the cache, or null (-1), partition metadata is taken from the new one. - when leader epoch is less than the one in the cache, partition metadata remains the same. Also when full metadata is necessary, the cache is used for storing it and for matching topics in the regex, removing the need to store the full metadata result, that is about the same size, but the cache is updated more accurately.

by reducing or skipping tests with a large number of elements

on an unreachable broker prevents refreshing the controller or the coordinator until that broker becomes reachable again

It's due to the metadata propagation period so even if producing to the topic is done, the metadata is not yet propagated to all brokers. As in `test_wait_metadata_update` we wait 1s for propagation. TBD: checking the JMX metrics about metadata propagation to tell exactly when metadata has been propagated.

This issue happens when a broker is destroying and there are enqueued operations to add buffers to id. If the operations are executed after purging the buffers their aren't destroyed and the refcnt for `rkbuf_rkb` prevents the broker and whole instance from being finally destroyed

on `rd_kafka_toppar_delegate_to_leader`

produces a _TIMED_OUT error and then after retrying the delivery callbacks are called with _MSG_TIMED_OUT.

for them, as this way it won't be requested later and will skip metadata refresh with a log containing "already being requested". Related issue is solved with subscription versions and tests 0143 and 0146 still pass Replace wait cache hints in case consumer group metadata refresh wasn't sent to request them again. Avoid joining the group if not all topics are in cache but metadata request couldn't be sent

version to set the released one

…after AK version upgrade to rc4

milindl

Reviewed individual commits and fixups. Please merge with all commits intact. Thank you for the fixes!

failing after the produce requests with a _STATE error.

Test 0146 is less dependent on timing.

both variations to track first metadata request as well

milindl

I reviewed the newly added commits.

airlock-confluentinc bot force-pushed the dev_run_all_tests_no_flakyness branch 2 times, most recently from 0ef423a to b02a4eb Compare February 18, 2025 18:48

emasab added a commit to mfleming/librdkafka that referenced this pull request Feb 18, 2025

PR confluentinc#4970: Code and tests fixes to make the full test suit…

bc59ef4

…e pass

airlock-confluentinc bot force-pushed the dev_run_all_tests_no_flakyness branch from b02a4eb to 8a7a17a Compare February 18, 2025 19:32

emasab added a commit to mfleming/librdkafka that referenced this pull request Feb 18, 2025

PR confluentinc#4970: Code and tests fixes to make the full test suit…

3f5c130

…e pass

airlock-confluentinc bot force-pushed the dev_run_all_tests_no_flakyness branch 2 times, most recently from 30f2a9c to 41c2d23 Compare February 22, 2025 13:29

emasab added a commit to mfleming/librdkafka that referenced this pull request Feb 22, 2025

PR confluentinc#4970: Code and tests fixes to make the full test suit…

6365473

…e pass

emasab added a commit to mfleming/librdkafka that referenced this pull request Feb 22, 2025

PR confluentinc#4970: Code and tests fixes to make the full test suit…

4572ea0

…e pass

airlock-confluentinc bot force-pushed the dev_run_all_tests_no_flakyness branch from 41c2d23 to f67e0d4 Compare February 26, 2025 08:33

emasab added a commit to mfleming/librdkafka that referenced this pull request Feb 26, 2025

PR confluentinc#4970: Code and tests fixes to make the full test suit…

3d88daa

…e pass

airlock-confluentinc bot force-pushed the dev_run_all_tests_no_flakyness branch 2 times, most recently from 10e3b71 to 5609960 Compare February 26, 2025 20:17

milindl reviewed Feb 28, 2025

View reviewed changes

airlock-confluentinc bot pushed a commit that referenced this pull request Feb 28, 2025

PR #4970: Code and tests fixes to make the full test suite pass

af451bb

airlock-confluentinc bot force-pushed the dev_run_all_tests_no_flakyness branch 4 times, most recently from b15851b to 85daea2 Compare March 3, 2025 11:48

milindl reviewed Mar 4, 2025

View reviewed changes

airlock-confluentinc bot pushed a commit that referenced this pull request Mar 13, 2025

PR #4970: Code and tests fixes to make the full test suite pass

7e236f5

airlock-confluentinc bot force-pushed the dev_run_all_tests_no_flakyness branch from 1df9ab4 to cefa0ca Compare March 13, 2025 11:28

airlock-confluentinc bot pushed a commit that referenced this pull request Mar 13, 2025

PR #4970: Code and tests fixes to make the full test suite pass

925c2b2

airlock-confluentinc bot force-pushed the dev_run_all_tests_no_flakyness branch from cefa0ca to 537d426 Compare March 15, 2025 17:08

airlock-confluentinc bot pushed a commit that referenced this pull request Mar 18, 2025

PR #4970: Code and tests fixes to make the full test suite pass

2830d02

This was referenced Mar 25, 2025

Fix assignment lost, on illegal generation, during a commit #4908

Merged

Script to run tests in batches, with several modes and for a variable number of iterations #5002

Merged

airlock-confluentinc bot force-pushed the dev_run_all_tests_no_flakyness branch from 440e46e to bb1d731 Compare March 25, 2025 11:38

emasab marked this pull request as ready for review March 25, 2025 11:39

emasab added 19 commits March 27, 2025 11:28

Fix flakyness in test 0061

6d82eef

skip events generated before the assignment that lead to a test failure

Fix flakyness with metadata propagation in test 0085

2d943d2

Fix flakyness in test 0102:

d44e6d0

- avoid full metadata refresh during metadata propagation time after topic creation - Rebalance events order after max.poll.interval.ms exceeded

Fix flakyness in test 0137: don't consider error count during read me…

debbc7d

…ssages verification. Log warnings for the errors to identify the cause.

Subscription version to avoid stale metadata

7e3310a

calls cause an unknown topic or partition error

Allow unittests to complete before the timeout when using Valgrind

2115155

by reducing or skipping tests with a large number of elements

Fix for the case where a metadata refresh enqueued

68a9456

on an unreachable broker prevents refreshing the controller or the coordinator until that broker becomes reachable again

Test exclusions when running against Apache Kafka >= 4.0

36b6312

Test 848 consumer group protocol in 4.0-rc0 with librdkafka

671169f

Fix flakyness in compaction test

b7af99f

Fix for the case where an assert was failing

2625445

on `rd_kafka_toppar_delegate_to_leader`

Fix flakyness in test 0105 first the buffer times out and

2eea02e

produces a _TIMED_OUT error and then after retrying the delivery callbacks are called with _MSG_TIMED_OUT.

Last AK RC, waiting for a trivup fix on scala

6a6c6ba

version to set the released one

Copyright updates

f54e80f

Disabling test requiring KIP-848 TOPIC_AUTHORIZATION_FAILED handling …

e07ae46

…after AK version upgrade to rc4

airlock-confluentinc bot force-pushed the dev_run_all_tests_no_flakyness branch from d1d53e2 to e07ae46 Compare March 27, 2025 10:32

milindl approved these changes Mar 27, 2025

View reviewed changes

emasab added 3 commits March 27, 2025 19:08

[test 0105] Increase likehood of TxnOffsetCommit

21d1596

failing after the produce requests with a _STATE error.

Speed up fetch restart after a fetch error and an offset validation.

7d562d5

Test 0146 is less dependent on timing.

[test 0146] Move start request tracking before first produce in

c110479

both variations to track first metadata request as well

airlock-confluentinc bot force-pushed the dev_run_all_tests_no_flakyness branch from eefdbb9 to c110479 Compare March 27, 2025 18:11

emasab requested a review from milindl March 27, 2025 18:13

milindl approved these changes Mar 28, 2025

View reviewed changes

emasab merged commit 6378837 into master Mar 28, 2025
2 checks passed

emasab deleted the dev_run_all_tests_no_flakyness branch March 28, 2025 08:25

DDDFiish mentioned this pull request Aug 25, 2025

Client connection failures during Kafka rolling upgrade when broker listener configuration changes #5176

Open

		static int32_t
		rd_kafka_cgrp_subscription_set(rd_kafka_cgrp_t *rkcg,

	int32_t ret = rd_atomic32_add(&rkcg->rkcg_subscription_version, 1);
	int32_t new_subscription_version = rd_atomic32_add(&rkcg->rkcg_subscription_version, 1);

Code and tests fixes to make the full test suite pass #4970

Code and tests fixes to make the full test suite pass #4970

Uh oh!

Conversation

emasab commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

confluent-cla-assistant bot commented Feb 18, 2025

Uh oh!

milindl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

milindl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emasab Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emasab Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emasab Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milindl left a comment

Choose a reason for hiding this comment

Uh oh!

milindl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

emasab commented Feb 18, 2025 •

edited

Loading

emasab Mar 24, 2025 •

edited

Loading

emasab Mar 24, 2025 •

edited

Loading

emasab Mar 24, 2025 •

edited

Loading