Skip to content

Conversation

@brunal
Copy link
Contributor

@brunal brunal commented May 12, 2025

Rationale for this change

Arrow C++ slices arrays by bumping the top-level offset value.
However, Arrow Rust slices list arrays by slicing the value_offsets
buffer. When receiving a Rust Arrow Array in C++ (via the C data
interface), its IPC serialization fails to notice that the
value_offsets buffer needed to be updated, but it still updates the
values buffer. This leads to a corrupt array on deserialization, with
an value_offsets buffer that points past the end of the values array.

This PR fixes the IPC serialization by also looking at value_offset(0) to
determine whether the value_offsets buffer needs reconstructing,
instead of only looking at offset().
This works because value_offset(int) is the offets buffer, shifted by the top-level offset.
We still need to check for offset(), to account for array starting with an empty list (multiple
zeroes at the start of the offsets buffer).

What changes are included in this PR?

The fix and nothing else

Are these changes tested?

Yes

Are there any user-facing changes?

No (well, unless they are affected by the bug)

This PR contains a "Critical Fix". (the changes fix (b) a bug that caused incorrect or invalid data to be produced) : valid operations on valid data produce invalid data.

@github-actions
Copy link

⚠️ GitHub issue #46407 has been automatically assigned in GitHub to PR creator.

@kou kou changed the title GH-46407: fix IPC serialization of sliced list arrays GH-46407: [C++] fix IPC serialization of sliced list arrays May 13, 2025
@brunal
Copy link
Contributor Author

brunal commented May 13, 2025

OK, I actually broke FeatherTests/TestFeather.SliceStringsRoundTrip... Please don't review this yet :-/

@brunal
Copy link
Contributor Author

brunal commented May 13, 2025

I got a working fix -- we still need to look at the top-level offset() if the sliced array starts with zeroes.

@brunal brunal force-pushed the rust-slice-list-array-ipc2 branch from 6065d9c to 8b28c53 Compare May 14, 2025 17:39
@brunal
Copy link
Contributor Author

brunal commented May 14, 2025

I have trimmed down the pull request to being just the fix, and nothing else. I'll likely do the cleanups as a follow-up.

@pitrou
Copy link
Member

pitrou commented May 19, 2025

@brunal That's an interesting finding and it's doubly interesting that we hadn't noticed it before. Let's see how CI fares.

@pitrou
Copy link
Member

pitrou commented May 19, 2025

The CI failures are unrelated: they are tracked in #46498 and #46343.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 19, 2025
@brunal
Copy link
Contributor Author

brunal commented May 21, 2025

Since the scope of the patch is expanding, I took the liberty of performing a cleanup of the function in the last commit. Let me know if you'd prefer it to happen separately, and I'll revert it.

@pitrou pitrou changed the title GH-46407: [C++] fix IPC serialization of sliced list arrays GH-46407: [C++] Fix IPC serialization of sliced list arrays May 21, 2025
@pitrou
Copy link
Member

pitrou commented May 21, 2025

@brunal The additional changes look fine to me. I'll take another look at the PR today.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An updated review. The fix looks fine, the tests could maybe be improved a bit, see below.

@brunal
Copy link
Contributor Author

brunal commented May 27, 2025

For the additional test: tried with

TEST_P(TestFeather, SliceListsRoundTrip) {
  std::shared_ptr<RecordBatch> batch;
  ASSERT_OK(ipc::test::MakeListRecordBatch(&batch));
  CheckSlices(batch);
}

but I'm getting

Failed
'WriteTable(table, stream_.get(), GetProperties())' failed with Type error: Unsupported Feather V1 type: list_view<item: int32>. Use V2 format to serialize all Arrow types.
/Users/cauet/arrow/cpp/src/arrow/ipc/feather.cc:602  ToFlatbufferType(*values.type())
/Users/cauet/arrow/cpp/src/arrow/ipc/feather.cc:639  WriteArrayV1(chunk, dst, &out->values)
/Users/cauet/arrow/cpp/src/arrow/ipc/feather.cc:679  WriteColumnV1(*table.column(i), dst, &col)

As I can't seem to find a spec about feather formats I'm not sure where to go from there.

@pitrou
Copy link
Member

pitrou commented May 27, 2025

As I can't seem to find a spec about feather formats I'm not sure where to go from there.

"Feather" is just an old synonym for the IPC file format. I think you could just skip the test on Feather "V1".

@brunal brunal requested a review from pitrou May 30, 2025 11:09
brunal and others added 6 commits June 3, 2025 17:24
…offsets[0] > 0.

Arrow C++ slices arrays by bumping the top-level `offset` value.
However, Arrow Rust slices list arrays by slicing the `value_offsets`
buffer. When receiving a Rust Arrow Array in C++ (via the C data
interface), its IPC serialization fails to notice that the
`value_offsets` buffer needed to be updated, but it still updates the
`values` buffer.  This leads to a corrupt array on deserialization, with
an `value_offsets` buffer that points past the end of the values array.

This commit fixes the IPC serialization by also looking at
value_offset(0) to determine whether the `value_offsets` buffer needs
reconstructing, instead of just looking at offset().

Additionally, this commit updates the comment surrounding the logic, as
it had 2 issues:
1. It hints that offset > 0 and value_offsets[0] > 0 happen together,
   when they actually tend to be exclusive (... unless you slice twice,
   once in Rust and once in C++).
2. It mentions slicing the values, when that does not happen in the
   function where the comment appears (GetZeroBasedValueOffsets), but at
   call site (Visit(Array)).

Notes:
* I'm surprised the ListViewArray does not have this bug. The code it
  uses is slightly different. I did not dig into its precise behaviour.
* The function could use a cleanup. There is no need for the `offset`
  symbol, which triggers a copy of the shared_ptr of the offsets buffer
  for nothing.
…fset is > 0.

Instead, just slice the offsets buffer. This plays nicely with the
truncation logic that already exists below.
* Extend the loop to cover `i == array.length()` instead of doing it
  with a dedicated statement.
  + assign the source offsets at once, instead of callling a method on
    every loop body.

* Avoid copying the `offsets_buffer` shared_ptr (i.e. avoid refcount++
  into refcount--).
  + the function does not rely anymore on complex rules of type
    deduction for `auto` variables (dropping reference and
    cv-qualification).

* Early return when array.length() == 0.
@pitrou pitrou force-pushed the rust-slice-list-array-ipc2 branch from 20a0072 to 6d9393d Compare June 3, 2025 15:28
@pitrou
Copy link
Member

pitrou commented Jun 3, 2025

"Feather" is just an old synonym for the IPC file format. I think you could just skip the test on Feather "V1".

Sorry, I got this slightly wrong. Feather V2 is a synonym for the IPC file format, but Feather V1 was a different (and obsolete) format. I fixed the comment and made it a GTest skip reason instead.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I will wait for CI and will merge if green. Thank you very much @brunal for finding and fixing this critical bug.

@pitrou
Copy link
Member

pitrou commented Jun 3, 2025

CI failures are unrelated.

@pitrou pitrou merged commit 0e85c12 into apache:main Jun 3, 2025
49 of 60 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Jun 3, 2025
pitrou added a commit to pitrou/arrow that referenced this pull request Jun 4, 2025
PR apache#46408 changed by mistake list-view IPC tests to use the same data as list tests.
This was detected as a duplicate corpus file by the OSS-Fuzz CI build.

This PR also includes a fix for a regression in the CUDA tests, due to reading non-CPU memory.
pitrou added a commit that referenced this pull request Jun 9, 2025
### Rationale for this change

PR #46408 included a typo that changed list-view IPC tests to use the same data as list tests. This was detected as a duplicate corpus file by the OSS-Fuzz CI build.

### What changes are included in this PR?

Undo mistake that led to using the same test data for lists and list-views. Also fix a regression in the CUDA tests, due to reading non-CPU memory when fetching the first offset in a list/binary array.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* GitHub Issue: #46704

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
alinaliBQ pushed a commit to Bit-Quill/arrow that referenced this pull request Jun 17, 2025
### Rationale for this change

PR apache#46408 included a typo that changed list-view IPC tests to use the same data as list tests. This was detected as a duplicate corpus file by the OSS-Fuzz CI build.

### What changes are included in this PR?

Undo mistake that led to using the same test data for lists and list-views. Also fix a regression in the CUDA tests, due to reading non-CPU memory when fetching the first offset in a list/binary array.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* GitHub Issue: apache#46704

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit 0e85c12.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants