-
Notifications
You must be signed in to change notification settings - Fork 8
Incremental topK with fractional index #72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
KyleAMathews
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great stuff! Looking forward to seeing the benchmarks
| "fractional-indexing": "^3.2.0", | ||
| "murmurhash-js": "^1.0.0" | ||
| "murmurhash-js": "^1.0.0", | ||
| "sorted-btree": "^1.8.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5.8kb gzipped https://bundlephobia.com/package/[email protected]
This is enough extra code weight (~24% increase to tanstack/db) that depending on where the crossover point ends up being, this could be an opt-in thing. I.e. only use if you have 50k+ items in a a collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the idea. We want to do some initial benchmarking to see when the turnover point is between using the array version or the tree version. We could automatically switch between them based on the size of the collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok perfect, yeah that'd be easy with an async import 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KyleAMathews problem with using an async import is that it propagates to our API but d2mini is sync, i don't think we want to make it async. Not sure how to get around this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevin-dp we can treat it like a JIT optimization perhaps? If the first sync run is too slow/big, we load sorted-btree in the background and start using it when it's loaded.
I agree we shouldn't make this async.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like:
- monitor the size of shapes
- once a single shape reaches a certain size threshold we download/load the tree version of the operator
- when starting a query, if one of the source collections is over a certain size, and we have already loaded the tree version of the operator we use that, if not then we don't.
- additional optimisation would be restart an existing query, with the other operator, once it has loaded, but this seems less needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KyleAMathews yes something we could do later if need be. For now, i introduced an async loadBTree function that must be called before using the tree variant of the operator. That way, we can keep the operator sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like:
* monitor the size of shapes * once a single shape reaches a certain size threshold we download/load the tree version of the operator * when starting a query, if one of the source collections is over a certain size, and we have already loaded the tree version of the operator we use that, if not then we don't. * additional optimisation would be restart an existing query, with the other operator, once it has loaded, but this seems less needed
Just-in-time data structures in the wild 😃
samwillis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kevin-dp, all looks really good!
My one suggestion is we split the BTree version into a serrate operator, in a separate file.
So the array version is topKWithFractionIndex and orderByWithFractionIndex, and then we have a separate topKWithFractionIndexBTree and orderByWithFractionIndexBTree. That way when the Btree isn't used it won't be bundled - at the moment the condition on which implementation to use will cause the Btree to be pulled in all the time. It should be possible to do this without duplication is you subclass TopKWithFractionalIndexOperator as TopKWithFractionalIndexBtreeOperator.
samwillis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One note, and it needs a changeset, but other than that ![]()
I'd like to benchmark it before we ship it, to make sure the two versions perform as expected. |
…plementation details.
db524d7 to
edad4be
Compare
…g the sorted-btree dependency.
* WIP incremental topKWithFractionalIndex * Incremental topKWithFractionalIndex * Fix tests to not assume particular fractional indices as those are implementation details. * Introduce a TopK data structure * B+ tree variant of topKWithFractionalIndex * Extend unit tests to test all insertion and deletion cases * Formatting * Unit test for duplicate values * Expose useTree option also on sortBy operator * Split array and B+ tree variants in separate operators * Add missing imports * Add a orderBy operator that uses topK with B+ tree variant * Formatting * Trigger CI * Remove useTree option * Changeset * Dynamically import B+ tree library * Also run orderByWithFractionalIndex tests for both the array and B+ tree variant * Split tree version of orderBy into its own file to enable tree shaking the sorted-btree dependency. * Improved changeset


This PR introduces changes to the topK operator because the previous implementation was not incremental. This PR provides 2 implementations: an array and a B+ tree implementation. The array implementation internally keeps a sorted array of elements to efficiently find the position where to insert/delete but the actual insertion/deletion is still in linear time. This is fine for small to medium collections. For big collections, we want to use the B+ tree implementation such that insertions and deletions are in logarithmic time.
TODO: