Skip to content

PRIMARY KEY isn't always used in left joins #9520

@kennethklee

Description

@kennethklee

TLDR; slow left join where primary key can be used as index, but isn't. benchmark size 30k rows, slow query takes 0.30 sec, optimal query takes 0.00 sec.

there's a case where the primary key isn't used in a left join, when it should be.

here's the query plan for two identical queries on two tables, test and test2. test uses primary key. test2 doesn't have a primary key, and uses an index on col, id.

Image

here's the reproduction:

# setup
SQL_CMD="dolt sql" # change to `mysql` to test difference in behavior
$SQL_CMD <<< "CREATE DATABASE test"

$SQL_CMD <<< "DROP TABLE IF EXISTS test"
$SQL_CMD <<< "CREATE TABLE test (id VARCHAR(255), parent VARCHAR(255), primary key (id))"

$SQL_CMD <<< "DROP TABLE IF EXISTS test2"
$SQL_CMD <<< "CREATE TABLE test2 (id VARCHAR(255), parent VARCHAR(255))"
$SQL_CMD <<< "CREATE UNIQUE INDEX idx_test2_id ON test2 (id)"

# data generation (not really needed for query plan, but here for performance benchmarks)
TABLE="test" # change this test2 to generate for 
# 10,000 grandparents
for j in {0..9}; do
  sql="INSERT INTO $TABLE (id, parent) VALUES ('test_$(( j * 1000 + 1 ))', NULL)"
  for ((i = j * 1000 + 2; i <= (j + 1) * 1000; i++)); do
    sql+=",('test_$i', NULL)"
  done
  $SQL_CMD <<< "$sql"
done

# 10,000 parents
for j in {0..9}; do
  sql="INSERT INTO $TABLE (id, parent) VALUES ('test_$(( j * 1000 + 10001 ))', 'test_$(( j * 1000 ))')"
  for ((i = j * 1000 + 10002; i <= (j + 1) * 1000 + 10000; i++)); do
    sql+=",('test_$i', 'test_$(( i - 10000 ))')"
  done
  $SQL_CMD <<< "$sql"
done

# 10,000 children
for j in {0..9}; do
  sql="INSERT INTO $TABLE (id, parent) VALUES ('test_$(( j * 1000 + 20001 ))', 'test_$(( j * 1000 + 10000 ))')"
  for ((i = j * 1000 + 20002; i <= (j + 1) * 1000 + 20000; i++)); do
    sql+=",('test_$i', 'test_$(( i - 10000 ))')"
  done
  $SQL_CMD <<< "$sql"
done

now for the query:

select *
from test child
  left join test parent on parent.id = child.parent
  left join test grandparent on grandparent.id = parent.parent
where child.id in ('test_0', 'test_5000', 'test_10000', 'test_15000', 'test_20000', 'test_25000')

Change the table from test to test2 for the other side.

performance different on my machine:

test/main*> select *
         -> from test child
         ->   left join test parent on parent.id = child.parent
         ->   left join test grandparent on grandparent.id = parent.parent
         -> where child.id in ('test_0', 'test_5000', 'test_10000', 'test_15000', 'test_20000', 'test_25000');
+------------+------------+------------+-----------+-----------+--------+
| id         | parent     | id         | parent    | id        | parent |
+------------+------------+------------+-----------+-----------+--------+
| test_10000 | NULL       | NULL       | NULL      | NULL      | NULL   |
| test_15000 | test_5000  | test_5000  | NULL      | NULL      | NULL   |
| test_20000 | test_10000 | test_10000 | NULL      | NULL      | NULL   |
| test_25000 | test_15000 | test_15000 | test_5000 | test_5000 | NULL   |
| test_5000  | NULL       | NULL       | NULL      | NULL      | NULL   |
+------------+------------+------------+-----------+-----------+--------+
5 rows in set (0.30 sec) 


test/main*> select *
         -> from test2 child
 on parent.id = child.parent
         ->   left join test2 parent on parent.id = child.parent
         ->   left join test2 grandparent on grandparent.id = parent.parent
         -> where child.id in ('test_0', 'test_5000', 'test_10000', 'test_15000', 'test_20000', 'test_25000');
+------------+------------+------------+-----------+-----------+--------+
| id         | parent     | id         | parent    | id        | parent |
+------------+------------+------------+-----------+-----------+--------+
| test_10000 | NULL       | NULL       | NULL      | NULL      | NULL   |
| test_15000 | test_5000  | test_5000  | NULL      | NULL      | NULL   |
| test_20000 | test_10000 | test_10000 | NULL      | NULL      | NULL   |
| test_25000 | test_15000 | test_15000 | test_5000 | test_5000 | NULL   |
| test_5000  | NULL       | NULL       | NULL      | NULL      | NULL   |
+------------+------------+------------+-----------+-----------+--------+
5 rows in set (0.00 sec) 

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions