New Merge Join planner #1934

nicktobey · 2023-08-09T23:57:30Z

This should fix dolthub/dolt#6020 once finished, and then some.

The killer new feature in this new join planner is "Multi-Column Merge Joins", that is, merges where the comparison used for merges incorporates multiple filter conditions. This allows us to, in some cases, choose a much more selective index for merge joins. This improves both memory usage and performance because there will be fewer cases where the join iterator needs to keep multiple secondary rows in memory and cross-join them with multiple primary rows.

The algorithm goes like this:

For each index on the left table:
- Compute the max set of filter expressions that match that index
- Check to see if any indexes on the right table match that same set of filters in the same order.
  - If so, use this set of filter expressions to generate a Merge Join plan. If there are multiple expressions, we combine them into a comparison on tuples.
- Remove the last filter expression and check again; repeating until the "matched filters" list is empty.

I added a test in join_planning_tests that demonstrates the potential of this new algorithm, allowing us to select a better index that otherwise allowed.

However, there still some remaining work before this can be merged.

Costing

Currently the choice of index and comparison expression does not influence costing: all merge join plans on two given tables will have the same cost. This needs to be fixed to ensure we actually use these improved indexes once we find them.

The costing function for MergeJoins also seems suspiciously low: we may currently be favoring MergeJoins even when another join type might be preferable. Attempts to improve the cost function should take this into account.

Correctness

Looking at the changed tests, there are three main ways that plan tests were affected:

Tests that previously generated MergeJoins but don't anymore

Since this PR doesn't touch costing, tests that no longer produce MergeJoins are a red flag, because it means that there may be some plans that the old implementation generates that the new one doesn't. These should be investigated before merging.

Tests that now generate MergeJoins but didn't before

Since this PR doesn't touch costing, tests that produce MergeJoins but didn't before imply that these new plans weren't being generated before. This means either the old planner was missing these plans, or these plans may not be correct and may be a bug in the new planner. These should be inspected to make sure the new plan is correct.

Tests that changed from one MergeJoin plan to another, equally useful MergeJoin plan

This was likely a consequence of the new algorithm generating the same set of plans but in a different order. Given that we currently cost all these plans equally, it's not surprising that small changes in the planner will cause churn here. These are fine.

…eady checked they can only match at most one column.

max-hoffman

I think the concept is sound, it will need a lot more tests but you should feel free to push ahead to make this PR-ready. Correctness is the main concern for merge join, there are a lot of special cases. filter nullability, left joins, empty tables, etc. I'll need to walk through the logic more closely to understand whether there is a more efficient way to find optimal index matches. Ideally we'd take advantage of bitsets to make the code easy to read and also fast. We spend so much time (at database runtime but also just while developing) doing filter/column comparisons that it's worth investing in conciseness here.

max-hoffman · 2023-08-17T21:01:25Z

sql/analyzer/indexed_joins.go

-						jb.Right = d.Child
+		}
+		for _, lIndex := range lIndexes {
+			matchedEqFilters := matchedFiltersForLeftIndex(lIndex, join.Left.RelProps.FuncDeps().Constants(), eqFilters)


functional dependencies are usually good at this -- the set of equality columns plus constants gives you an input ColSet. We've already built a ColSet for each index. We could use a combo of ColSet intersection and ordinal subtraction on the index ColSet to find all of the potential patching indexes (CRDB uses the phrase "InterestingOrderings" for the outcome of this process).

This will likely eventually be superseded by using functional dependencies to recognize when a previous filter forces a subsequenct filter to be constant.

…s where there are additional filters.

…erated values. The previous implementation had an issue where it assumed the type used in the received IndexLookup, but this type can actually depend on exactly how the lookup was generated (and whether the bounds value was parsed from the query or generated internally.) This makes no such assumptions and adds extra tests.

nicktobey · 2023-08-31T00:30:59Z

Okay, this is finally out of draft.

The only changes since @max-hoffman last looked at it are the new coster and the updated tests.

There's one new test suite, named "merge join large and small table"

There's a lot of changes to brittle tests, but having inspected them I believe these are all acceptable changes.

Of the modified tests, they fall into a couple of categories:

The new coster typically increases the cost for merge joins, because the previous cost was too low. This is especially apparent in cases where neither indexed lookup is provably unique (max one row per key), because in those cases we may have O(N*M) behavior, where N and M are the sizes of the child tables.
The new coster allows merges that weren't previously possible, so some tests use those new merges.
The old coster ignored things like the choice of index for the merge, instead just choosing whichever merge plan it happened to encounter first. The new merge planner may generate plans in a different order, changing which plan is selected. Or, because we now take things like index selection into account, one choice of index is these tests gets replaced with a better one.

max-hoffman

Mostly LGTM, there is a lot going on here. More correctness tests would be good. Double checking that this solves the customer's perf problem would be helpful. I'd anticipate mergeJoin issues/problems requiring some updating/compacting of this code over time. Improvements to the memo data structures probably will also let us simplify this over time.

sql/analyzer/indexed_joins.go

sql/memo/coster.go

…e join is a comparision between two columns.

…ver into nicktobey/mergejoin

…d `isInjectiveMerge`

…e the loop.

…larProps.

…in building merge joins, handle the case where one of the key expressions is also determined to be constant.

nicktobey added 5 commits August 9, 2023 14:32

New implementation of addMergeJoins.

fef2a1b

Allow comparison expresions that aren't column literals, since we alr…

30c15a3

…eady checked they can only match at most one column.

Refactor combineIntoTuple

37574b1

Add tests for new merge join planner.

1a4f4ae

Give helper functions more clear names and add docstrings.

6458e64

max-hoffman reviewed Aug 17, 2023

View reviewed changes

nicktobey added 2 commits August 21, 2023 13:55

Allow the same index column to match multiple filters.

61f55a9

This will likely eventually be superseded by using functional dependencies to recognize when a previous filter forces a subsequenct filter to be constant.

Implmement new coster for MergeJoin.

194376f

nicktobey force-pushed the nicktobey/mergejoin branch from 167b59e to 194376f Compare August 24, 2023 00:06

nicktobey added 4 commits August 25, 2023 13:33

Merge branch 'main' into nicktobey/mergejoin

12fef8d

Allow index scan in join to put a range on the target column for case…

f83c611

…s where there are additional filters.

Update join planning tests for new merge coster.

65657ad

nicktobey force-pushed the nicktobey/mergejoin branch from 03ebdec to 65657ad Compare August 31, 2023 00:21

nicktobey marked this pull request as ready for review August 31, 2023 00:22

[ga-format-pr] Run ./format_repo.sh to fix formatting

845a880

max-hoffman approved these changes Aug 31, 2023

View reviewed changes

sql/analyzer/indexed_joins.go Outdated Show resolved Hide resolved

sql/analyzer/indexed_joins.go Outdated Show resolved Hide resolved

sql/analyzer/indexed_joins.go Outdated Show resolved Hide resolved

sql/memo/coster.go Outdated Show resolved Hide resolved

nicktobey added 10 commits August 31, 2023 15:45

Remove inaccurate comment from coster: every key expression in a merg…

02bed5f

…e join is a comparision between two columns.

Add regression test to query_plans.go

907f5a2

Add additional planning tests.

8838561

Merge branch 'nicktobey/mergejoin' of github.com:dolthub/go-mysql-ser…

0f4c82f

…ver into nicktobey/mergejoin

Rename isUniquqLookup and isLinearMerge to isInjectiveLookup an…

3dd2523

…d `isInjectiveMerge`

Move comments describing main loop of addMergeJoins to be right befor…

f802586

…e the loop.

Update typo in query_plans.go

f3a8935

Remove getColumnRefFromScalar in favor of checking the columns in Sca…

5dd2366

…larProps.

When checking to see if the secondary index matches a set of filters …

7ffa17a

…in building merge joins, handle the case where one of the key expressions is also determined to be constant.

Add additional planning tests for the new merge join.

71dcf54

nicktobey merged commit c470785 into main Sep 2, 2023

nicktobey deleted the nicktobey/mergejoin branch September 2, 2023 03:16

chenrui333 mentioned this pull request Sep 11, 2023

dolt 1.14.1 Homebrew/homebrew-core#142030

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

New Merge Join planner #1934

New Merge Join planner #1934

Uh oh!

nicktobey commented Aug 9, 2023 •

edited

Loading

Uh oh!

max-hoffman left a comment

Uh oh!

max-hoffman Aug 17, 2023

Uh oh!

nicktobey commented Aug 31, 2023

Uh oh!

max-hoffman left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

New Merge Join planner #1934

New Merge Join planner #1934

Uh oh!

Conversation

nicktobey commented Aug 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Costing

Correctness

Tests that previously generated MergeJoins but don't anymore

Tests that now generate MergeJoins but didn't before

Tests that changed from one MergeJoin plan to another, equally useful MergeJoin plan

Uh oh!

max-hoffman left a comment

Choose a reason for hiding this comment

Uh oh!

max-hoffman Aug 17, 2023

Choose a reason for hiding this comment

Uh oh!

nicktobey commented Aug 31, 2023

Uh oh!

max-hoffman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nicktobey commented Aug 9, 2023 •

edited

Loading