[Data] - Add support for drop expr() for drop_columns #58387

goutamvenkat-anyscale · 2025-11-03T23:09:45Z

Description

Right now drop_columns uses map_batches instead of an expression thereby preventing any pushdown optimizations.

Following changes were made:

Added DropExpr in Expressions
Used [star(), drop(cols)] in drop_columns
Updated Projection Pushdown
Had to set mongo's client to None because of operator fusion

Before: drop_columns used map_batches → no fusion with Read → different operator structure
After: drop_columns uses expressions → CAN fuse with Read → the datasource needs to be serializable earlier/more frequently

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Goutam <[email protected]>

python/ray/data/_internal/logical/rules/projection_pushdown.py

gemini-code-assist

Code Review

This pull request is a great improvement. It refactors drop_columns to use an expression-based approach by introducing DropExpr. This change enables drop_columns to participate in logical plan optimizations like projection pushdown, which should improve performance. The implementation is thorough, touching all layers of the expression system from definition to evaluation and optimization. The addition of comprehensive tests, especially the parametrized tests covering various fusion scenarios, gives high confidence in the correctness of this complex change.

python/ray/data/tests/test_map.py

python/ray/data/tests/test_expressions.py

goutamvenkat-anyscale · 2025-11-03T23:15:35Z

python/ray/data/_internal/logical/rules/projection_pushdown.py

-            if e.name not in downstream_input_column_rename_map:
-                projected_upstream_output_col_exprs.append(e)
+        # Get DropExpr from downstream (before translation)
+        downstream_drop_exprs_pre_translation = [


pre_renaming

Signed-off-by: Goutam <[email protected]>

goutamvenkat-anyscale · 2025-11-04T02:01:55Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors drop_columns to use an expression-based approach instead of map_batches. This is a significant architectural improvement that enables projection pushdown optimizations for drop_columns. The changes include introducing a new DropExpr, updating the projection fusion and pushdown logic to correctly handle this new expression, and adding comprehensive tests to ensure correctness across various scenarios. The implementation is solid and the new tests are thorough. A potential serialization issue in the Mongo datasource is also fixed. Overall, this is a high-quality contribution that improves the performance and consistency of the Data API.

Signed-off-by: Goutam <[email protected]>

aslonnie

does not seem to require CI review?

github-actions · 2025-11-20T12:25:41Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

[Data] - Add support for drop expr() for drop_columns

257fa84

Signed-off-by: Goutam <[email protected]>

goutamvenkat-anyscale requested a review from a team as a code owner November 3, 2025 23:09

Add rst

6e19b03

Signed-off-by: Goutam <[email protected]>

goutamvenkat-anyscale commented Nov 3, 2025

View reviewed changes

python/ray/data/_internal/logical/rules/projection_pushdown.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Nov 3, 2025

View reviewed changes

python/ray/data/tests/test_map.py Outdated Show resolved Hide resolved

goutamvenkat-anyscale commented Nov 3, 2025

View reviewed changes

python/ray/data/tests/test_expressions.py Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

goutamvenkat-anyscale commented Nov 3, 2025

View reviewed changes

goutamvenkat-anyscale added 2 commits November 3, 2025 15:17

Add test back

3855084

Signed-off-by: Goutam <[email protected]>

Restore test

05f1086

Signed-off-by: Goutam <[email protected]>

goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Nov 3, 2025

goutamvenkat-anyscale added 2 commits November 3, 2025 16:15

Merge branch 'master' into goutam/drop_expr

a22072c

Fix tests

7ce3903

Signed-off-by: Goutam <[email protected]>

This comment was marked as outdated.

Sign in to view

gemini-code-assist bot reviewed Nov 4, 2025

View reviewed changes

goutamvenkat-anyscale added 4 commits November 3, 2025 18:16

Add test for dropping computed cols

f822bfb

Signed-off-by: Goutam <[email protected]>

repr test

22970f2

Signed-off-by: Goutam <[email protected]>

clean up mongo

a28e89c

Signed-off-by: Goutam <[email protected]>

Update pymongoarrow

7468667

Signed-off-by: Goutam <[email protected]>

goutamvenkat-anyscale requested a review from a team as a code owner November 4, 2025 15:32

Go back to older mongo version

66f5401

Signed-off-by: Goutam <[email protected]>

aslonnie reviewed Nov 6, 2025

View reviewed changes

richardliaw mentioned this pull request Nov 15, 2025

Ray Data Q4 Roadmap + Wishlist #58665

Open

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 20, 2025

goutamvenkat-anyscale removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] - Add support for drop expr() for drop_columns #58387

[Data] - Add support for drop expr() for drop_columns #58387

Uh oh!

goutamvenkat-anyscale commented Nov 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

goutamvenkat-anyscale Nov 3, 2025

Uh oh!

goutamvenkat-anyscale commented Nov 4, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

aslonnie left a comment

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Data] - Add support for drop expr() for drop_columns #58387

Are you sure you want to change the base?

[Data] - Add support for drop expr() for drop_columns #58387

Uh oh!

Conversation

goutamvenkat-anyscale commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

goutamvenkat-anyscale Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale commented Nov 4, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

aslonnie left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

goutamvenkat-anyscale commented Nov 3, 2025 •

edited

Loading