Skip to content

Conversation

alexeykudinkin
Copy link
Contributor

@alexeykudinkin alexeykudinkin commented May 1, 2025

Why are these changes needed?

This PR adds Dataset.join operator allowing 2 datasets to be joined using previously added Hash-shuffle implementation.

Currently following types of joins are supported:

  • Inner (left/right)
  • Outer (left/right)

In the future we'll be adding support for more join types.

Changes

  • Added JoinOperator
  • Added Dataset.join
  • Added tests

Related issue number

Closes #18911 (Finally!!!)

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Alexey Kudinkin <[email protected]>
@alexeykudinkin alexeykudinkin requested a review from a team as a code owner May 1, 2025 22:23
@alexeykudinkin alexeykudinkin requested a review from richardliaw May 1, 2025 22:24
@alexeykudinkin alexeykudinkin added the go add ONLY when ready to merge, run all tests label May 1, 2025
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
TorchBatchType = Union[Dict[str, "torch.Tensor"], CollatedData]

BT_API_GROUP = "Basic Transformations"
J_API_GROUP = "Joining Operation APIs"
Copy link
Contributor

@richardliaw richardliaw May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be part of SMD_API_GROUP, which would include zip/merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i can add to that group but i don't really understand the intention behind this grouping as it seems practically arbitrary in many cases

Copy link
Contributor

@richardliaw richardliaw May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the grouping names gets rendered for documentation for better user discovery (grouping APIs into subsections).

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!!

Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
@richardliaw richardliaw merged commit c6c338c into ray-project:master May 3, 2025
5 checks passed
@richardliaw richardliaw mentioned this pull request May 23, 2025
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-backlog go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Datasets] [Feature] Support joins/merges
3 participants