Support for Compute Functions on Nested Arrays

Hi,
I have read through the docs and issues as best as I can and I am under the impression that its not possible to do compute functions on nested arrays.

I modified a group_by & aggregate example like so, putting the pa.array values into nested lists.
```
t = pa.table([
      pa.array(["a", "a", "b", "b", "c"]),
      pa.array([[1], [2], [3], [4], [5]]),
], names=["keys", "values"])

t.group_by("keys").aggregate([("values", "sum")])
```

The error is this:
```
ArrowNotImplementedError: Function 'hash_sum' has no kernel matching input types (array[list<item: int64>], array[uint32])
```

I assume this means the function doesn't know how to operate on a list? Is there a way to do this? I have large tensors which I can reshape into 1 dimension to store in a Record Batch, but I don't know how I can perform computations on their values. It seems like the other way is to use the Tensor type but it can't be used in a Record Batch or with compute can it?

The PyArrow zero copy from Numpy means this is an effective way to get data across the network using the IPC writer and its fairly easy to add other record types for custom meta data, but it would be a pity to have to then send this data back to numpy for all my computations and lose out on all that great SIMD parallelization.

Is there a better way?

Related links:
https://github.com/apache/arrow/issues/4802
https://issues.apache.org/jira/browse/ARROW-1614

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for Compute Functions on Nested Arrays #12553

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support for Compute Functions on Nested Arrays #12553

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions