Skip to content

Conversation

@gaozhangmin
Copy link
Contributor

According to pulsar's fix: apache/pulsar#20642
race conditions problem is also exists in bk's ZKRegistrationClient.java

@StevenLuMT
Copy link
Member

Can you provide some information about the process of locating the problem you encountered, or is this just a simple logic synchronization with Pulsar?

@gaozhangmin
Copy link
Contributor Author

@StevenLuMT
After switching the bookie from readonly to writable status online, the broker side continuously generated error logs: "Bookie Handle not available." The issue was resolved only after restarting the bookie. It was discovered that there was a bug in the ZkRegisterClient relied upon by the Pulsar broker, which caused the broker's bookie status cache update to be lost. Although the bookie was actually in a normal state, the broker failed to resolve the bookie's hostname due to the missing cache, resulting in the error: "Bookie Handle not available."

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a race condition in BookKeeper's ZKRegistrationClient where bookie information could be lost due to notification timing issues. The fix separates the cache for writable and read-only bookies and ensures sequential processing of ZooKeeper events to prevent race conditions.

  • Splits the single bookieServiceInfoCache into separate caches for writable and read-only bookies
  • Introduces sequential processing of ZooKeeper notifications using a Sequencer utility class
  • Adds comprehensive test coverage to verify the fix handles network delays during bookie transitions

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
ZKRegistrationClient.java Core fix implementing separate caches and sequential event processing
FutureUtils.java Adds Sequencer utility class for sequential task execution
ZKRegistrationClientTest.java New test to verify race condition fix with network delay simulation
FaultInjectableZKRegistrationManager.java Test utility for fault injection during registration operations

@gaozhangmin
Copy link
Contributor Author

rerun failure checks

1 similar comment
@gaozhangmin
Copy link
Contributor Author

rerun failure checks

@wenbingshen
Copy link
Member

@gaozhangmin Is it different from this fix #4481

@gaozhangmin
Copy link
Contributor Author

@gaozhangmin Is it different from this fix #4481

it's different case

@gaozhangmin gaozhangmin force-pushed the fix_ZKRegistrationClient_1 branch from 27b9b85 to 634bf4f Compare August 13, 2025 05:23
@gaozhangmin gaozhangmin force-pushed the fix_ZKRegistrationClient_1 branch from 634bf4f to 40e60d5 Compare August 13, 2025 05:25
@gaozhangmin
Copy link
Contributor Author

rerun failure checks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants