Skip to content

[C++] Rust sliced ListArrays get corrupted by C++ IPC serialization #46407

@brunal

Description

@brunal

Describe the bug, including details regarding any error messages, version, and platform.

Steps to reproduce:

  • Create a ListArray in Rust
  • Slice it (at index > 0)
  • Send it to C++ via the C data interface
  • Perform IPC serialization of the array (wrapped in a RecordBatch)
  • The resulting message produces invalid data upon deserialization, in C++ or Rust, for its offset buffer points past the end of its child data.

Here is a standalone python reproduction:

import pyarrow as pa

# This ListArray represents [[3, 4, 5]]. It was sliced the way Rust slices
# ListArrays.
# The C++ slicing would have resulted in offsets_buffer = [0, 2, 5] and
# top-level offset = 1.
list_array = pa.ListArray.from_arrays(offsets=pa.array([2, 5]), values=[1, 2, 3, 4, 5])
list_array.validate()
assert list_array == pa.array([[3, 4, 5]])

table = pa.table({"col": list_array})
sink = pa.BufferOutputStream()
pa.ipc.new_stream(sink, table.schema).write_table(table)

reader = pa.ipc.RecordBatchStreamReader(sink.getvalue())
table_deserialized = pa.Table.from_batches(list(reader))

# This raises pyarrow.lib.ArrowInvalid: In chunk 0: Invalid: First or last list offset out of bounds
table_deserialized.column(0).validate()

The gist of the issue is that:

  • Rust and C++ slice ListArray differently
  • C++ bumps the top-level offset of the ArrayData
  • However Rust does not maintain a top-level offset. Instead, it slices the offset buffers
  • Upon IPC serialization of a ListArray, C++ only looks at the top-level offset do decide whether to rebuild the offsets buffer. However, it properly rebuilds the child data
  • This leads to a corrupt serialized message

I have a test+fix for this.

Component(s)

C++

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions