Skip to content

Support for Compute Functions on Nested Arrays #12553

@madhavajay

Description

@madhavajay

Hi,
I have read through the docs and issues as best as I can and I am under the impression that its not possible to do compute functions on nested arrays.

I modified a group_by & aggregate example like so, putting the pa.array values into nested lists.

t = pa.table([
      pa.array(["a", "a", "b", "b", "c"]),
      pa.array([[1], [2], [3], [4], [5]]),
], names=["keys", "values"])

t.group_by("keys").aggregate([("values", "sum")])

The error is this:

ArrowNotImplementedError: Function 'hash_sum' has no kernel matching input types (array[list<item: int64>], array[uint32])

I assume this means the function doesn't know how to operate on a list? Is there a way to do this? I have large tensors which I can reshape into 1 dimension to store in a Record Batch, but I don't know how I can perform computations on their values. It seems like the other way is to use the Tensor type but it can't be used in a Record Batch or with compute can it?

The PyArrow zero copy from Numpy means this is an effective way to get data across the network using the IPC writer and its fairly easy to add other record types for custom meta data, but it would be a pity to have to then send this data back to numpy for all my computations and lose out on all that great SIMD parallelization.

Is there a better way?

Related links:
#4802
https://issues.apache.org/jira/browse/ARROW-1614

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions