-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[Data] - Add support for drop expr() for drop_columns #58387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Data] - Add support for drop expr() for drop_columns #58387
Conversation
Signed-off-by: Goutam <[email protected]>
Signed-off-by: Goutam <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a great improvement. It refactors drop_columns to use an expression-based approach by introducing DropExpr. This change enables drop_columns to participate in logical plan optimizations like projection pushdown, which should improve performance. The implementation is thorough, touching all layers of the expression system from definition to evaluation and optimization. The addition of comprehensive tests, especially the parametrized tests covering various fusion scenarios, gives high confidence in the correctness of this complex change.
| if e.name not in downstream_input_column_rename_map: | ||
| projected_upstream_output_col_exprs.append(e) | ||
| # Get DropExpr from downstream (before translation) | ||
| downstream_drop_exprs_pre_translation = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pre_renaming
Signed-off-by: Goutam <[email protected]>
Signed-off-by: Goutam <[email protected]>
Signed-off-by: Goutam <[email protected]>
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors drop_columns to use an expression-based approach instead of map_batches. This is a significant architectural improvement that enables projection pushdown optimizations for drop_columns. The changes include introducing a new DropExpr, updating the projection fusion and pushdown logic to correctly handle this new expression, and adding comprehensive tests to ensure correctness across various scenarios. The implementation is solid and the new tests are thorough. A potential serialization issue in the Mongo datasource is also fixed. Overall, this is a high-quality contribution that improves the performance and consistency of the Data API.
Signed-off-by: Goutam <[email protected]>
Signed-off-by: Goutam <[email protected]>
Signed-off-by: Goutam <[email protected]>
Signed-off-by: Goutam <[email protected]>
Signed-off-by: Goutam <[email protected]>
aslonnie
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does not seem to require CI review?
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
Description
Right now
drop_columnsusesmap_batchesinstead of an expression thereby preventing any pushdown optimizations.Following changes were made:
DropExprin Expressions[star(), drop(cols)]indrop_columnsNonebecause of operator fusionBefore: drop_columns used map_batches → no fusion with Read → different operator structure
After: drop_columns uses expressions → CAN fuse with Read → the datasource needs to be serializable earlier/more frequently
Related issues
Additional information