-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Fix non-deterministic hash-distributed exchange routing in multi-stage query engine #17323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix non-deterministic hash-distributed exchange routing in multi-stage query engine #17323
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #17323 +/- ##
============================================
- Coverage 63.26% 63.25% -0.01%
- Complexity 1433 1474 +41
============================================
Files 3133 3135 +2
Lines 186168 186477 +309
Branches 28416 28495 +79
============================================
+ Hits 117774 117958 +184
- Misses 59326 59413 +87
- Partials 9068 9106 +38
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Backward compatibility regression tests are failing with an interesting error, I'll need to take a closer look. Also weirdly enough, it's failing against Edit: the issue is because we're sorting at execution time instead of plan time, so different sender stages could send partitions to different workers in the same receiver stage when the servers are on different versions. Moving the sort to plan time in the broker should fix the issue. |
1b61486 to
b8f8073
Compare
gortiz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add an integration test with an example?
| // receiving worker. | ||
| // Without this sorting, different stages could route the same hash value to different workers, resulting in | ||
| // incorrect join/union/intersect results for pre-partitioned sends (where a full partition shuffle is skipped). | ||
| mailboxInfoList.sort(Comparator.comparingInt(info -> info.getWorkerIds().get(0))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure the worker ids is going to have exactly one instance? Shouldn't we sort by all elements in case the first element is equal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be more than one if there's stage parallelism, but in that case the worker IDs per physical server should be contiguous IIUC -
pinot/pinot-query-planner/src/main/java/org/apache/pinot/query/routing/WorkerManager.java
Lines 268 to 271 in 0e702d0
| for (QueryServerInstance serverInstance : candidateServers) { | |
| for (int i = 0; i < stageParallelism; i++) { | |
| workerIdToServerInstanceMap.put(workerId++, serverInstance); | |
| } |
Lines 256 to 263 in 43142cd
| // IMP: Return mailbox info sorted by workerIds. This is because SendingMailbox are created in this order, and | |
| // record assignment for hash exchange follows modulo arithmetic. e.g. if we have sending mailbox in order: | |
| // [worker-1, worker-0], then records with modulo 0 hash would end up in worker-1. | |
| // Note that the workerIds list will be >1 in length only when there's a parallelism change. It's important to | |
| // also know that MailboxSendOperator will iterate over this List<MailboxInfo> in order, and within each iteration | |
| // iterate over all the workerIds of that MailboxInfo. The result List<SendingMailbox> is used for modulo | |
| // arithmetic for any partitioning exchange strategy. | |
| result.sort(Comparator.comparingInt(info -> info.getWorkerIds().get(0))); |
Jackie-Jiang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
Let's also document the contract for this visitor as java-doc for this class.
Trying to understand where does the code break. Without this fix, mailboxInfoList won't have deterministic order of servers for each worker id. Where do we access it based on the index of the server?
| // receiving worker. | ||
| // Without this sorting, different stages could route the same hash value to different workers, resulting in | ||
| // incorrect join/union/intersect results for pre-partitioned sends (where a full partition shuffle is skipped). | ||
| mailboxInfoList.sort(Comparator.comparingInt(info -> info.getWorkerIds().get(0))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it easier to follow if we change serverToWorkerIdsMap to a LinkedHashMap? Essentially we want to preserve the worker id order in the generated server list
It's eventually used here through the Lines 55 to 79 in 5384079
Lines 168 to 178 in 5384079
Hm no, I believe the issue is more so that there's no guarantee that across different branches in the same query plan tree, we'll have the same partition <-> workerId mapping because the worker / mailbox order depends on the iteration order of various maps (enabled server instances, routing table etc.) in Lines 102 to 123 in 5384079
IMO, the cleanest solution is guaranteeing that the |
Jackie-Jiang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sync'ed offline. LinkedHashMap and the current explicit sorting achieve the same goal: deterministic order of servers based on worker order. Both of them work.
The root cause of this issue is the indeterministic order from worker id to server, which should be fixed separately. Let's add a TODO for that
Yup that's right, thanks for clarifying -- I had misunderstood your earlier point. |
HashMapiteration during worker assignment at various places in WorkerManager.