Skip to content

Conversation

@goutamvenkat-anyscale
Copy link
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale commented Nov 3, 2025

Description

Right now drop_columns uses map_batches instead of an expression thereby preventing any pushdown optimizations.

Following changes were made:

  1. Added DropExpr in Expressions
  2. Used [star(), drop(cols)] in drop_columns
  3. Updated Projection Pushdown
  4. Had to set mongo's client to None because of operator fusion

Before: drop_columns used map_batches → no fusion with Read → different operator structure
After: drop_columns uses expressions → CAN fuse with Read → the datasource needs to be serializable earlier/more frequently

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner November 3, 2025 23:09
Signed-off-by: Goutam <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great improvement. It refactors drop_columns to use an expression-based approach by introducing DropExpr. This change enables drop_columns to participate in logical plan optimizations like projection pushdown, which should improve performance. The implementation is thorough, touching all layers of the expression system from definition to evaluation and optimization. The addition of comprehensive tests, especially the parametrized tests covering various fusion scenarios, gives high confidence in the correctness of this complex change.

cursor[bot]

This comment was marked as outdated.

if e.name not in downstream_input_column_rename_map:
projected_upstream_output_col_exprs.append(e)
# Get DropExpr from downstream (before translation)
downstream_drop_exprs_pre_translation = [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre_renaming

@goutamvenkat-anyscale goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Nov 3, 2025
@goutamvenkat-anyscale
Copy link
Contributor Author

/gemini review

cursor[bot]

This comment was marked as outdated.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors drop_columns to use an expression-based approach instead of map_batches. This is a significant architectural improvement that enables projection pushdown optimizations for drop_columns. The changes include introducing a new DropExpr, updating the projection fusion and pushdown logic to correctly handle this new expression, and adding comprehensive tests to ensure correctness across various scenarios. The implementation is solid and the new tests are thorough. A potential serialization issue in the Mongo datasource is also fixed. Overall, this is a high-quality contribution that improves the performance and consistency of the Data API.

@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner November 4, 2025 15:32
Copy link
Collaborator

@aslonnie aslonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not seem to require CI review?

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 20, 2025
@goutamvenkat-anyscale goutamvenkat-anyscale removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

2 participants