Incremental topK with fractional index #72

kevin-dp · 2025-06-23T11:58:37Z

This PR introduces changes to the topK operator because the previous implementation was not incremental. This PR provides 2 implementations: an array and a B+ tree implementation. The array implementation internally keeps a sorted array of elements to efficiently find the position where to insert/delete but the actual insertion/deletion is still in linear time. This is fine for small to medium collections. For big collections, we want to use the B+ tree implementation such that insertions and deletions are in logarithmic time.

TODO:

Benchmark these 2 implementations to confirm their theoretical time complexity holds in practice

KyleAMathews

great stuff! Looking forward to seeing the benchmarks

KyleAMathews · 2025-06-23T13:21:34Z

packages/d2mini/package.json

    "fractional-indexing": "^3.2.0",
-    "murmurhash-js": "^1.0.0"
+    "murmurhash-js": "^1.0.0",
+    "sorted-btree": "^1.8.1"


5.8kb gzipped https://bundlephobia.com/package/[email protected]

This is enough extra code weight (~24% increase to tanstack/db) that depending on where the crossover point ends up being, this could be an opt-in thing. I.e. only use if you have 50k+ items in a a collection.

Yes, that's the idea. We want to do some initial benchmarking to see when the turnover point is between using the array version or the tree version. We could automatically switch between them based on the size of the collection.

Ok perfect, yeah that'd be easy with an async import 🚀

@KyleAMathews problem with using an async import is that it propagates to our API but d2mini is sync, i don't think we want to make it async. Not sure how to get around this.

@kevin-dp we can treat it like a JIT optimization perhaps? If the first sync run is too slow/big, we load sorted-btree in the background and start using it when it's loaded.

I agree we shouldn't make this async.

Something like:

monitor the size of shapes

once a single shape reaches a certain size threshold we download/load the tree version of the operator

when starting a query, if one of the source collections is over a certain size, and we have already loaded the tree version of the operator we use that, if not then we don't.

additional optimisation would be restart an existing query, with the other operator, once it has loaded, but this seems less needed

@KyleAMathews yes something we could do later if need be. For now, i introduced an async loadBTree function that must be called before using the tree variant of the operator. That way, we can keep the operator sync.

Something like:

* monitor the size of shapes * once a single shape reaches a certain size threshold we download/load the tree version of the operator * when starting a query, if one of the source collections is over a certain size, and we have already loaded the tree version of the operator we use that, if not then we don't. * additional optimisation would be restart an existing query, with the other operator, once it has loaded, but this seems less needed

Just-in-time data structures in the wild 😃

samwillis

Thanks @kevin-dp, all looks really good!

My one suggestion is we split the BTree version into a serrate operator, in a separate file.

So the array version is topKWithFractionIndex and orderByWithFractionIndex, and then we have a separate topKWithFractionIndexBTree and orderByWithFractionIndexBTree. That way when the Btree isn't used it won't be bundled - at the moment the condition on which implementation to use will cause the Btree to be pulled in all the time. It should be possible to do this without duplication is you subclass TopKWithFractionalIndexOperator as TopKWithFractionalIndexBtreeOperator.

samwillis

One note, and it needs a changeset, but other than that

packages/d2mini/src/operators/orderBy.ts

kevin-dp · 2025-06-24T09:17:10Z

One note, and it needs a changeset, but other than that

I'd like to benchmark it before we ship it, to make sure the two versions perform as expected.

…plementation details.

…ree variant

…g the sorted-btree dependency.

kevin-dp · 2025-07-08T07:44:51Z

Here are the results of some microbenchmarks that benchmark the topK operator in isolation of the rest of the pipeline. We notice no difference between the arrray and tree versions until the collection becomes big enough at which point the array variant takes a big hit (probably because JS is no longer able to optimize the array or perhaps it can't be stored contiguously in memory). From that point on, the tree variant is much faster.

I've also performed some benchmarks of a d2mini pipeline containing the topK operator but i couldn't reproduce the pattern above. Basically, the b tree version would also become slow. That indicates that some other part of the pipeline is the bottleneck. Still, the first benchmark shows that it is useful to have both versions of the topK operator. When we later solve the bottleneck in the d2mini pipeline we may reap the benefits of the tree variant of the topK operator.

* WIP incremental topKWithFractionalIndex * Incremental topKWithFractionalIndex * Fix tests to not assume particular fractional indices as those are implementation details. * Introduce a TopK data structure * B+ tree variant of topKWithFractionalIndex * Extend unit tests to test all insertion and deletion cases * Formatting * Unit test for duplicate values * Expose useTree option also on sortBy operator * Split array and B+ tree variants in separate operators * Add missing imports * Add a orderBy operator that uses topK with B+ tree variant * Formatting * Trigger CI * Remove useTree option * Changeset * Dynamically import B+ tree library * Also run orderByWithFractionalIndex tests for both the array and B+ tree variant * Split tree version of orderBy into its own file to enable tree shaking the sorted-btree dependency. * Improved changeset

KyleAMathews reviewed Jun 23, 2025

View reviewed changes

samwillis requested changes Jun 24, 2025

View reviewed changes

kevin-dp requested a review from samwillis June 24, 2025 08:45

samwillis approved these changes Jun 24, 2025

View reviewed changes

packages/d2mini/src/operators/orderBy.ts Outdated Show resolved Hide resolved

kevin-dp closed this Jun 24, 2025

kevin-dp reopened this Jun 24, 2025

kevin-dp mentioned this pull request Jul 7, 2025

Revert to old index implementation #76

Merged

kevin-dp added 16 commits July 7, 2025 16:39

WIP incremental topKWithFractionalIndex

041c2d0

Incremental topKWithFractionalIndex

c5589a6

Fix tests to not assume particular fractional indices as those are im…

9fc3a2c

…plementation details.

Introduce a TopK data structure

4fef948

B+ tree variant of topKWithFractionalIndex

de145fd

Extend unit tests to test all insertion and deletion cases

9b33a52

Formatting

5d41351

Unit test for duplicate values

71d0dcb

Expose useTree option also on sortBy operator

c605244

Split array and B+ tree variants in separate operators

98bc39e

Add missing imports

1e402df

Add a orderBy operator that uses topK with B+ tree variant

d1473b2

Formatting

bffd28b

Trigger CI

2640de6

Remove useTree option

69551ea

Changeset

edad4be

kevin-dp force-pushed the kevindp/incremental-topK-fractional branch from db524d7 to edad4be Compare July 7, 2025 14:43

kevin-dp added 3 commits July 8, 2025 09:15

Dynamically import B+ tree library

4e22638

Also run orderByWithFractionalIndex tests for both the array and B+ t…

845bbe6

…ree variant

Split tree version of orderBy into its own file to enable tree shakin…

37eb83f

…g the sorted-btree dependency.

Improved changeset

5fc79d9

kevin-dp merged commit c5a59cf into main Jul 8, 2025
1 check passed

kevin-dp deleted the kevindp/incremental-topK-fractional branch July 8, 2025 08:11

Incremental topK with fractional index #72

Incremental topK with fractional index #72

Uh oh!

Conversation

kevin-dp commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KyleAMathews left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samwillis left a comment

Choose a reason for hiding this comment

Uh oh!

samwillis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevin-dp commented Jun 24, 2025

Uh oh!

kevin-dp commented Jul 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kevin-dp commented Jun 23, 2025 •

edited

Loading