-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[data] fix repartitioning empty datasets #54107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Hao Chen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes the repartitioning behavior for empty datasets and refines the handling of block schemas during repartitioning.
- Added a new test to ensure empty datasets are repartitioned correctly with both shuffling options.
- Updated the block schema handling in the repartition task scheduler, including a default fallback for empty results.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
python/ray/data/tests/test_repartition_e2e.py | Added test case for empty dataset repartitioning with both shuffled and non-shuffled options. |
python/ray/data/_internal/planner/exchange/split_repartition_task_scheduler.py | Refined block schema detection logic and improved error messaging for unknown block schemas. |
Comments suppressed due to low confidence (1)
python/ray/data/tests/test_repartition_e2e.py:210
- [nitpick] Consider adding a docstring to this test function to clarify its purpose and expected behavior when repartitioning an empty dataset.
def test_repartition_empty_datasets(ray_start_regular_shared_2_cpus, shuffle):
python/ray/data/_internal/planner/exchange/split_repartition_task_scheduler.py
Show resolved
Hide resolved
num_partitions = 5 | ||
ds_empty = ray.data.range(100).filter(lambda row: False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Consider using the simplest inputs to make the intent clearer
num_partitions = 5 | |
ds_empty = ray.data.range(100).filter(lambda row: False) | |
num_partitions = 1 | |
ds_empty = ray.data.range(1).filter(lambda row: False) |
Fix the following error when repartitioning an empty dataset: ``` first_block_schema = reduce_metadata_schema[0].schema IndexError: list index out of range ``` Signed-off-by: Hao Chen <[email protected]>
Fix the following error when repartitioning an empty dataset: ``` first_block_schema = reduce_metadata_schema[0].schema IndexError: list index out of range ``` Signed-off-by: Hao Chen <[email protected]> Signed-off-by: elliot-barn <[email protected]>
Fix the following error when repartitioning an empty dataset: