Skip to content

Conversation

ahgraber
Copy link
Contributor

  • CosineSimilarityBuilder now uses a sharded/chunked similarity calculation to significantly reduce memory requirements
  • CosineSimilarityBuilder and JaccardSimilarityBuilder now leverage generate_execution_plan to support async iteration over tasks (for potential future multithreading or improved concurrency)
  • Added unit tests

@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Jun 13, 2025
@ahgraber ahgraber force-pushed the feature/relationship-builder branch 2 times, most recently from fb608d0 to d30e58d Compare July 18, 2025 17:21
@anistark
Copy link
Contributor

Thanks for the PR @ahgraber
Could you please rebase and check for the merge conflicts?

@ahgraber
Copy link
Contributor Author

@anistark - I've merged the changes from upstream

@anistark
Copy link
Contributor

@anistark - I've merged the changes from upstream

Thanks @ahgraber
However, I see the type check ci failed. Could you please run make type locally to verify the issue.

@ahgraber
Copy link
Contributor Author

It works on my computer (TM).
Screenshot 2025-08-26 at 9 40 54 AM

Copy link
Contributor

@anistark anistark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahgraber There's several breaking changes. Can we have backwards compatibility for _find_similar_embedding_pairs:

  • extra param changing the method signature. Try default value for the extra param.
  • sync to async conversion. Probably better to provide both sync and async versions.
  • return type change from List -> Set. Is there a pure benefit or we can use Set internally and keep List for backwards compatibility.

Also, we got rid of experimental. Please rebase once more and align with latest changes.

…methods

- Refactored the JaccardSimilarityBuilder to use async methods for finding similar embedding pairs.
- Introduced a new method `generate_execution_plan` to generate coroutines of comparisons for better tracking and potential concurrency
- Updated the `transform` method to utilize the new async functionality.
- Added comprehensive test coverage for the new features in the JaccardSimilarityBuilder.
- Improved logic for generating similar and dissimilar sets based on input constraints.
- keep _find_similar_embedding_pairs as sync
- _find_similar_embedding_pairs should return List
@ahgraber ahgraber force-pushed the feature/relationship-builder branch from 301d9c7 to 0917aea Compare September 1, 2025 14:00
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Sep 1, 2025
- remove block_size parameter from _find_similar_embedding_pairs and use CosineSimilarityBuilder attribute
@ahgraber
Copy link
Contributor Author

ahgraber commented Sep 1, 2025

@anistark I've cleanly rebased.

Re: requested changes:

  • extra param changing the method signature. Try default value for the extra param. -> now uses the CosineSimilarityBuilder attribute (which is new, but has a default) instead of altering the method signature.
  • sync to async conversion. Probably better to provide both sync and async versions. -> have reverted to sync
  • return type change from List -> Set. Is there a pure benefit or we can use Set internally and keep List for backwards compatibility. -> have reverted to List. Set is more efficient (set is used internally for fast deduplication), but list conversion post-hoc is trivial relative to the cosine similarity calculations.

@ahgraber ahgraber requested a review from anistark September 1, 2025 14:18
Copy link
Contributor

@anistark anistark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@anistark
Copy link
Contributor

anistark commented Sep 1, 2025

/claude-review

Copy link

claude bot commented Sep 1, 2025

Claude encountered an error —— View job

Failed with exit code 128

I'll analyze this and get back to you.

@anistark anistark merged commit e821105 into explodinggradients:main Sep 1, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants