Skip to content

New Merge Join planner #1934

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Sep 2, 2023
Merged

New Merge Join planner #1934

merged 22 commits into from
Sep 2, 2023

Conversation

nicktobey
Copy link
Contributor

@nicktobey nicktobey commented Aug 9, 2023

This should fix dolthub/dolt#6020 once finished, and then some.

The killer new feature in this new join planner is "Multi-Column Merge Joins", that is, merges where the comparison used for merges incorporates multiple filter conditions. This allows us to, in some cases, choose a much more selective index for merge joins. This improves both memory usage and performance because there will be fewer cases where the join iterator needs to keep multiple secondary rows in memory and cross-join them with multiple primary rows.

The algorithm goes like this:

  • For each index on the left table:
    • Compute the max set of filter expressions that match that index
    • Check to see if any indexes on the right table match that same set of filters in the same order.
      • If so, use this set of filter expressions to generate a Merge Join plan. If there are multiple expressions, we combine them into a comparison on tuples.
    • Remove the last filter expression and check again; repeating until the "matched filters" list is empty.

I added a test in join_planning_tests that demonstrates the potential of this new algorithm, allowing us to select a better index that otherwise allowed.

However, there still some remaining work before this can be merged.

Costing

Currently the choice of index and comparison expression does not influence costing: all merge join plans on two given tables will have the same cost. This needs to be fixed to ensure we actually use these improved indexes once we find them.

The costing function for MergeJoins also seems suspiciously low: we may currently be favoring MergeJoins even when another join type might be preferable. Attempts to improve the cost function should take this into account.

Correctness

Looking at the changed tests, there are three main ways that plan tests were affected:

Tests that previously generated MergeJoins but don't anymore

Since this PR doesn't touch costing, tests that no longer produce MergeJoins are a red flag, because it means that there may be some plans that the old implementation generates that the new one doesn't. These should be investigated before merging.

Tests that now generate MergeJoins but didn't before

Since this PR doesn't touch costing, tests that produce MergeJoins but didn't before imply that these new plans weren't being generated before. This means either the old planner was missing these plans, or these plans may not be correct and may be a bug in the new planner. These should be inspected to make sure the new plan is correct.

Tests that changed from one MergeJoin plan to another, equally useful MergeJoin plan

This was likely a consequence of the new algorithm generating the same set of plans but in a different order. Given that we currently cost all these plans equally, it's not surprising that small changes in the planner will cause churn here. These are fine.

Copy link
Contributor

@max-hoffman max-hoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the concept is sound, it will need a lot more tests but you should feel free to push ahead to make this PR-ready. Correctness is the main concern for merge join, there are a lot of special cases. filter nullability, left joins, empty tables, etc. I'll need to walk through the logic more closely to understand whether there is a more efficient way to find optimal index matches. Ideally we'd take advantage of bitsets to make the code easy to read and also fast. We spend so much time (at database runtime but also just while developing) doing filter/column comparisons that it's worth investing in conciseness here.

jb.Right = d.Child
}
for _, lIndex := range lIndexes {
matchedEqFilters := matchedFiltersForLeftIndex(lIndex, join.Left.RelProps.FuncDeps().Constants(), eqFilters)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

functional dependencies are usually good at this -- the set of equality columns plus constants gives you an input ColSet. We've already built a ColSet for each index. We could use a combo of ColSet intersection and ordinal subtraction on the index ColSet to find all of the potential patching indexes (CRDB uses the phrase "InterestingOrderings" for the outcome of this process).

This will likely eventually be superseded by using functional dependencies to recognize when a previous filter forces a subsequenct filter to be constant.
@nicktobey nicktobey force-pushed the nicktobey/mergejoin branch from 167b59e to 194376f Compare August 24, 2023 00:06
…erated values.

The previous implementation had an issue where it assumed the type used in the received IndexLookup, but this type can actually depend on exactly how the lookup was generated (and whether the bounds value was parsed from the query or generated internally.)

This makes no such assumptions and adds extra tests.
@nicktobey nicktobey force-pushed the nicktobey/mergejoin branch from 03ebdec to 65657ad Compare August 31, 2023 00:21
@nicktobey nicktobey marked this pull request as ready for review August 31, 2023 00:22
@nicktobey
Copy link
Contributor Author

Okay, this is finally out of draft.

The only changes since @max-hoffman last looked at it are the new coster and the updated tests.

There's one new test suite, named "merge join large and small table"

There's a lot of changes to brittle tests, but having inspected them I believe these are all acceptable changes.

Of the modified tests, they fall into a couple of categories:

  • The new coster typically increases the cost for merge joins, because the previous cost was too low. This is especially apparent in cases where neither indexed lookup is provably unique (max one row per key), because in those cases we may have O(N*M) behavior, where N and M are the sizes of the child tables.
  • The new coster allows merges that weren't previously possible, so some tests use those new merges.
  • The old coster ignored things like the choice of index for the merge, instead just choosing whichever merge plan it happened to encounter first. The new merge planner may generate plans in a different order, changing which plan is selected. Or, because we now take things like index selection into account, one choice of index is these tests gets replaced with a better one.

Copy link
Contributor

@max-hoffman max-hoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, there is a lot going on here. More correctness tests would be good. Double checking that this solves the customer's perf problem would be helpful. I'd anticipate mergeJoin issues/problems requiring some updating/compacting of this code over time. Improvements to the memo data structures probably will also let us simplify this over time.

@nicktobey nicktobey merged commit c470785 into main Sep 2, 2023
@nicktobey nicktobey deleted the nicktobey/mergejoin branch September 2, 2023 03:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Choose most selective index for Merge Join.
2 participants