-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Feature: improve relationship builders for better async and reduced memory utilization #2077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: improve relationship builders for better async and reduced memory utilization #2077
Conversation
ahgraber
commented
Jun 13, 2025
- CosineSimilarityBuilder now uses a sharded/chunked similarity calculation to significantly reduce memory requirements
- CosineSimilarityBuilder and JaccardSimilarityBuilder now leverage generate_execution_plan to support async iteration over tasks (for potential future multithreading or improved concurrency)
- Added unit tests
fb608d0
to
d30e58d
Compare
Thanks for the PR @ahgraber |
@anistark - I've merged the changes from upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahgraber There's several breaking changes. Can we have backwards compatibility for _find_similar_embedding_pairs
:
- extra param changing the method signature. Try default value for the extra param.
- sync to async conversion. Probably better to provide both sync and async versions.
- return type change from List -> Set. Is there a pure benefit or we can use Set internally and keep List for backwards compatibility.
Also, we got rid of experimental. Please rebase once more and align with latest changes.
…methods - Refactored the JaccardSimilarityBuilder to use async methods for finding similar embedding pairs. - Introduced a new method `generate_execution_plan` to generate coroutines of comparisons for better tracking and potential concurrency - Updated the `transform` method to utilize the new async functionality. - Added comprehensive test coverage for the new features in the JaccardSimilarityBuilder.
- Improved logic for generating similar and dissimilar sets based on input constraints.
- keep _find_similar_embedding_pairs as sync - _find_similar_embedding_pairs should return List
301d9c7
to
0917aea
Compare
- remove block_size parameter from _find_similar_embedding_pairs and use CosineSimilarityBuilder attribute
@anistark I've cleanly rebased. Re: requested changes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/claude-review |
Claude encountered an error —— View job
I'll analyze this and get back to you. |