[Data] Avoid unnecessary conversion to Numpy when creating Arrow/Pandas blocks #51238

alexeykudinkin · 2025-03-11T05:16:59Z

Why are these changes needed?

Context

This change skips unnecessary blanket conversion to Numpy (applied to every chunk of data) before converting to Pyarrow.

That creates challenges when batches contain Arrow native Scalars which because of that are ultimately being serialized as ArrowPythonObjectType extension.

Changes

We revisit following conversion aspects and convert to Numpy passed in column values only in following cases:

Column name is TENSOR_COLUMN_NAME (for compatibility)
Provided column values are already represented by a tensor (either numpy, torch, etc)
Provided column values is a list of ndarrays (we do this for compatibility with previously existing behavior where all column values were blindly converted to Numpy leading to list of ndarrays being converted a tensor)

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

raulchen · 2025-03-11T18:23:15Z

lint is failing. please fix

Signed-off-by: Alexey Kudinkin <[email protected]>

…as blocks (ray-project#51238) Context --- This change skips unnecessary blanket conversion to Numpy (applied to every chunk of data) before converting to Pyarrow. That creates challenges when batches contain Arrow native `Scalars` which because of that are ultimately being serialized as `ArrowPythonObjectType` extension. Changes --- We revisit following conversion aspects and convert to Numpy passed in column values only in following cases: - Column name is `TENSOR_COLUMN_NAME` (for compatibility) - Provided column values are already represented by a tensor (either numpy, torch, etc) - Provided column values is a list of ndarrays (we do this for compatibility with previously existing behavior where all column values were blindly converted to Numpy leading to list of ndarrays being converted a tensor) --------- Signed-off-by: Alexey Kudinkin <[email protected]>

…as blocks (ray-project#51238) Context --- This change skips unnecessary blanket conversion to Numpy (applied to every chunk of data) before converting to Pyarrow. That creates challenges when batches contain Arrow native `Scalars` which because of that are ultimately being serialized as `ArrowPythonObjectType` extension. Changes --- We revisit following conversion aspects and convert to Numpy passed in column values only in following cases: - Column name is `TENSOR_COLUMN_NAME` (for compatibility) - Provided column values are already represented by a tensor (either numpy, torch, etc) - Provided column values is a list of ndarrays (we do this for compatibility with previously existing behavior where all column values were blindly converted to Numpy leading to list of ndarrays being converted a tensor) --------- Signed-off-by: Alexey Kudinkin <[email protected]> Signed-off-by: Dhakshin Suriakannu <[email protected]>

alexeykudinkin requested a review from a team as a code owner March 11, 2025 05:17

alexeykudinkin requested a review from raulchen March 11, 2025 05:17

alexeykudinkin added the go add ONLY when ready to merge, run all tests label Mar 11, 2025

raulchen approved these changes Mar 11, 2025

View reviewed changes

alexeykudinkin force-pushed the ak/arw-cnv-opt-fix branch from 181c205 to 705f8a1 Compare March 12, 2025 01:04

alexeykudinkin added 3 commits March 12, 2025 11:40

Avoiding unnecessary Numpy conversion when creating Arrow blocks

61dc366

Signed-off-by: Alexey Kudinkin <[email protected]>

Avoiding unnecessary Numpy conversion when creating Pandas blocks

4846eb7

Signed-off-by: Alexey Kudinkin <[email protected]>

Missing API annotations

f92c9fb

Signed-off-by: Alexey Kudinkin <[email protected]>

alexeykudinkin force-pushed the ak/arw-cnv-opt-fix branch from ccc4b47 to f92c9fb Compare March 12, 2025 18:40

bveeramani merged commit f6347c0 into master Mar 12, 2025
5 checks passed

bveeramani deleted the ak/arw-cnv-opt-fix branch March 12, 2025 22:14

hainesmichaelc added the community-backlog label May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Avoid unnecessary conversion to Numpy when creating Arrow/Pandas blocks #51238

[Data] Avoid unnecessary conversion to Numpy when creating Arrow/Pandas blocks #51238

Uh oh!

alexeykudinkin commented Mar 11, 2025

Uh oh!

raulchen commented Mar 11, 2025

Uh oh!

Uh oh!

Uh oh!

[Data] Avoid unnecessary conversion to Numpy when creating Arrow/Pandas blocks #51238

[Data] Avoid unnecessary conversion to Numpy when creating Arrow/Pandas blocks #51238

Uh oh!

Conversation

alexeykudinkin commented Mar 11, 2025

Why are these changes needed?

Context

Changes

Related issue number

Checks

Uh oh!

raulchen commented Mar 11, 2025

Uh oh!

Uh oh!

Uh oh!