-
-
Notifications
You must be signed in to change notification settings - Fork 232
New Merge Join planner #1934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Merge Join planner #1934
Conversation
…eady checked they can only match at most one column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the concept is sound, it will need a lot more tests but you should feel free to push ahead to make this PR-ready. Correctness is the main concern for merge join, there are a lot of special cases. filter nullability, left joins, empty tables, etc. I'll need to walk through the logic more closely to understand whether there is a more efficient way to find optimal index matches. Ideally we'd take advantage of bitsets to make the code easy to read and also fast. We spend so much time (at database runtime but also just while developing) doing filter/column comparisons that it's worth investing in conciseness here.
jb.Right = d.Child | ||
} | ||
for _, lIndex := range lIndexes { | ||
matchedEqFilters := matchedFiltersForLeftIndex(lIndex, join.Left.RelProps.FuncDeps().Constants(), eqFilters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
functional dependencies are usually good at this -- the set of equality columns plus constants gives you an input ColSet. We've already built a ColSet for each index. We could use a combo of ColSet intersection and ordinal subtraction on the index ColSet to find all of the potential patching indexes (CRDB uses the phrase "InterestingOrderings" for the outcome of this process).
This will likely eventually be superseded by using functional dependencies to recognize when a previous filter forces a subsequenct filter to be constant.
167b59e
to
194376f
Compare
…s where there are additional filters.
…erated values. The previous implementation had an issue where it assumed the type used in the received IndexLookup, but this type can actually depend on exactly how the lookup was generated (and whether the bounds value was parsed from the query or generated internally.) This makes no such assumptions and adds extra tests.
03ebdec
to
65657ad
Compare
Okay, this is finally out of draft. The only changes since @max-hoffman last looked at it are the new coster and the updated tests. There's one new test suite, named "merge join large and small table" There's a lot of changes to brittle tests, but having inspected them I believe these are all acceptable changes. Of the modified tests, they fall into a couple of categories:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM, there is a lot going on here. More correctness tests would be good. Double checking that this solves the customer's perf problem would be helpful. I'd anticipate mergeJoin issues/problems requiring some updating/compacting of this code over time. Improvements to the memo data structures probably will also let us simplify this over time.
…e join is a comparision between two columns.
…ver into nicktobey/mergejoin
…d `isInjectiveMerge`
…in building merge joins, handle the case where one of the key expressions is also determined to be constant.
This should fix dolthub/dolt#6020 once finished, and then some.
The killer new feature in this new join planner is "Multi-Column Merge Joins", that is, merges where the comparison used for merges incorporates multiple filter conditions. This allows us to, in some cases, choose a much more selective index for merge joins. This improves both memory usage and performance because there will be fewer cases where the join iterator needs to keep multiple secondary rows in memory and cross-join them with multiple primary rows.
The algorithm goes like this:
I added a test in
join_planning_tests
that demonstrates the potential of this new algorithm, allowing us to select a better index that otherwise allowed.However, there still some remaining work before this can be merged.
Costing
Currently the choice of index and comparison expression does not influence costing: all merge join plans on two given tables will have the same cost. This needs to be fixed to ensure we actually use these improved indexes once we find them.
The costing function for MergeJoins also seems suspiciously low: we may currently be favoring MergeJoins even when another join type might be preferable. Attempts to improve the cost function should take this into account.
Correctness
Looking at the changed tests, there are three main ways that plan tests were affected:
Tests that previously generated MergeJoins but don't anymore
Since this PR doesn't touch costing, tests that no longer produce MergeJoins are a red flag, because it means that there may be some plans that the old implementation generates that the new one doesn't. These should be investigated before merging.
Tests that now generate MergeJoins but didn't before
Since this PR doesn't touch costing, tests that produce MergeJoins but didn't before imply that these new plans weren't being generated before. This means either the old planner was missing these plans, or these plans may not be correct and may be a bug in the new planner. These should be inspected to make sure the new plan is correct.
Tests that changed from one MergeJoin plan to another, equally useful MergeJoin plan
This was likely a consequence of the new algorithm generating the same set of plans but in a different order. Given that we currently cost all these plans equally, it's not surprising that small changes in the planner will cause churn here. These are fine.