Skip to content

Conversation

@kevin-dp
Copy link
Contributor

@kevin-dp kevin-dp commented Jun 23, 2025

This PR introduces changes to the topK operator because the previous implementation was not incremental. This PR provides 2 implementations: an array and a B+ tree implementation. The array implementation internally keeps a sorted array of elements to efficiently find the position where to insert/delete but the actual insertion/deletion is still in linear time. This is fine for small to medium collections. For big collections, we want to use the B+ tree implementation such that insertions and deletions are in logarithmic time.

TODO:

  • Benchmark these 2 implementations to confirm their theoretical time complexity holds in practice

Copy link
Contributor

@KyleAMathews KyleAMathews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great stuff! Looking forward to seeing the benchmarks

"fractional-indexing": "^3.2.0",
"murmurhash-js": "^1.0.0"
"murmurhash-js": "^1.0.0",
"sorted-btree": "^1.8.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5.8kb gzipped https://bundlephobia.com/package/[email protected]

This is enough extra code weight (~24% increase to tanstack/db) that depending on where the crossover point ends up being, this could be an opt-in thing. I.e. only use if you have 50k+ items in a a collection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the idea. We want to do some initial benchmarking to see when the turnover point is between using the array version or the tree version. We could automatically switch between them based on the size of the collection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok perfect, yeah that'd be easy with an async import 🚀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KyleAMathews problem with using an async import is that it propagates to our API but d2mini is sync, i don't think we want to make it async. Not sure how to get around this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin-dp we can treat it like a JIT optimization perhaps? If the first sync run is too slow/big, we load sorted-btree in the background and start using it when it's loaded.

I agree we shouldn't make this async.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like:

  • monitor the size of shapes
  • once a single shape reaches a certain size threshold we download/load the tree version of the operator
  • when starting a query, if one of the source collections is over a certain size, and we have already loaded the tree version of the operator we use that, if not then we don't.
  • additional optimisation would be restart an existing query, with the other operator, once it has loaded, but this seems less needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KyleAMathews yes something we could do later if need be. For now, i introduced an async loadBTree function that must be called before using the tree variant of the operator. That way, we can keep the operator sync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like:

* monitor the size of shapes

* once a single shape reaches a certain size threshold we download/load the tree version of the operator

* when starting a query, if one of the source collections is over a certain size, and we have already loaded the tree version of the operator we use that, if not then we don't.

* additional optimisation would be restart an existing query, with the other operator, once it has loaded, but this seems less needed

Just-in-time data structures in the wild 😃

Copy link
Contributor

@samwillis samwillis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kevin-dp, all looks really good!

My one suggestion is we split the BTree version into a serrate operator, in a separate file.

So the array version is topKWithFractionIndex and orderByWithFractionIndex, and then we have a separate topKWithFractionIndexBTree and orderByWithFractionIndexBTree. That way when the Btree isn't used it won't be bundled - at the moment the condition on which implementation to use will cause the Btree to be pulled in all the time. It should be possible to do this without duplication is you subclass TopKWithFractionalIndexOperator as TopKWithFractionalIndexBtreeOperator.

@kevin-dp kevin-dp requested a review from samwillis June 24, 2025 08:45
Copy link
Contributor

@samwillis samwillis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One note, and it needs a changeset, but other than that :shipit:

@kevin-dp
Copy link
Contributor Author

One note, and it needs a changeset, but other than that :shipit:

I'd like to benchmark it before we ship it, to make sure the two versions perform as expected.

@kevin-dp kevin-dp force-pushed the kevindp/incremental-topK-fractional branch from db524d7 to edad4be Compare July 7, 2025 14:43
@kevin-dp
Copy link
Contributor Author

kevin-dp commented Jul 8, 2025

Here are the results of some microbenchmarks that benchmark the topK operator in isolation of the rest of the pipeline. We notice no difference between the arrray and tree versions until the collection becomes big enough at which point the array variant takes a big hit (probably because JS is no longer able to optimize the array or perhaps it can't be stored contiguously in memory). From that point on, the tree variant is much faster.
Screenshot 2025-06-30 at 10 27 05
Screenshot 2025-06-30 at 10 27 14

I've also performed some benchmarks of a d2mini pipeline containing the topK operator but i couldn't reproduce the pattern above. Basically, the b tree version would also become slow. That indicates that some other part of the pipeline is the bottleneck. Still, the first benchmark shows that it is useful to have both versions of the topK operator. When we later solve the bottleneck in the d2mini pipeline we may reap the benefits of the tree variant of the topK operator.

@kevin-dp kevin-dp merged commit c5a59cf into main Jul 8, 2025
1 check passed
@kevin-dp kevin-dp deleted the kevindp/incremental-topK-fractional branch July 8, 2025 08:11
cursor bot pushed a commit to samwillis/d2ts that referenced this pull request Jul 13, 2025
* WIP incremental topKWithFractionalIndex

* Incremental topKWithFractionalIndex

* Fix tests to not assume particular fractional indices as those are implementation details.

* Introduce a TopK data structure

* B+ tree variant of topKWithFractionalIndex

* Extend unit tests to test all insertion and deletion cases

* Formatting

* Unit test for duplicate values

* Expose useTree option also on sortBy operator

* Split array and B+ tree variants in separate operators

* Add missing imports

* Add a orderBy operator that uses topK with B+ tree variant

* Formatting

* Trigger CI

* Remove useTree option

* Changeset

* Dynamically import B+ tree library

* Also run orderByWithFractionalIndex tests for both the array and B+ tree variant

* Split tree version of orderBy into its own file to enable tree shaking the sorted-btree dependency.

* Improved changeset
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants