-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-46407: [C++] Fix IPC serialization of sliced list arrays #46408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
OK, I actually broke FeatherTests/TestFeather.SliceStringsRoundTrip... Please don't review this yet :-/ |
|
I got a working fix -- we still need to look at the top-level offset() if the sliced array starts with zeroes. |
6065d9c to
8b28c53
Compare
|
I have trimmed down the pull request to being just the fix, and nothing else. I'll likely do the cleanups as a follow-up. |
|
@brunal That's an interesting finding and it's doubly interesting that we hadn't noticed it before. Let's see how CI fares. |
|
Since the scope of the patch is expanding, I took the liberty of performing a cleanup of the function in the last commit. Let me know if you'd prefer it to happen separately, and I'll revert it. |
|
@brunal The additional changes look fine to me. I'll take another look at the PR today. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An updated review. The fix looks fine, the tests could maybe be improved a bit, see below.
|
For the additional test: tried with but I'm getting As I can't seem to find a spec about feather formats I'm not sure where to go from there. |
"Feather" is just an old synonym for the IPC file format. I think you could just skip the test on Feather "V1". |
…offsets[0] > 0. Arrow C++ slices arrays by bumping the top-level `offset` value. However, Arrow Rust slices list arrays by slicing the `value_offsets` buffer. When receiving a Rust Arrow Array in C++ (via the C data interface), its IPC serialization fails to notice that the `value_offsets` buffer needed to be updated, but it still updates the `values` buffer. This leads to a corrupt array on deserialization, with an `value_offsets` buffer that points past the end of the values array. This commit fixes the IPC serialization by also looking at value_offset(0) to determine whether the `value_offsets` buffer needs reconstructing, instead of just looking at offset(). Additionally, this commit updates the comment surrounding the logic, as it had 2 issues: 1. It hints that offset > 0 and value_offsets[0] > 0 happen together, when they actually tend to be exclusive (... unless you slice twice, once in Rust and once in C++). 2. It mentions slicing the values, when that does not happen in the function where the comment appears (GetZeroBasedValueOffsets), but at call site (Visit(Array)). Notes: * I'm surprised the ListViewArray does not have this bug. The code it uses is slightly different. I did not dig into its precise behaviour. * The function could use a cleanup. There is no need for the `offset` symbol, which triggers a copy of the shared_ptr of the offsets buffer for nothing.
…fset is > 0. Instead, just slice the offsets buffer. This plays nicely with the truncation logic that already exists below.
* Extend the loop to cover `i == array.length()` instead of doing it
with a dedicated statement.
+ assign the source offsets at once, instead of callling a method on
every loop body.
* Avoid copying the `offsets_buffer` shared_ptr (i.e. avoid refcount++
into refcount--).
+ the function does not rely anymore on complex rules of type
deduction for `auto` variables (dropping reference and
cv-qualification).
* Early return when array.length() == 0.
20a0072 to
6d9393d
Compare
Sorry, I got this slightly wrong. Feather V2 is a synonym for the IPC file format, but Feather V1 was a different (and obsolete) format. I fixed the comment and made it a GTest skip reason instead. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I will wait for CI and will merge if green. Thank you very much @brunal for finding and fixing this critical bug.
|
CI failures are unrelated. |
PR apache#46408 changed by mistake list-view IPC tests to use the same data as list tests. This was detected as a duplicate corpus file by the OSS-Fuzz CI build. This PR also includes a fix for a regression in the CUDA tests, due to reading non-CPU memory.
### Rationale for this change PR #46408 included a typo that changed list-view IPC tests to use the same data as list tests. This was detected as a duplicate corpus file by the OSS-Fuzz CI build. ### What changes are included in this PR? Undo mistake that led to using the same test data for lists and list-views. Also fix a regression in the CUDA tests, due to reading non-CPU memory when fetching the first offset in a list/binary array. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #46704 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change PR apache#46408 included a typo that changed list-view IPC tests to use the same data as list tests. This was detected as a duplicate corpus file by the OSS-Fuzz CI build. ### What changes are included in this PR? Undo mistake that led to using the same test data for lists and list-views. Also fix a regression in the CUDA tests, due to reading non-CPU memory when fetching the first offset in a list/binary array. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: apache#46704 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
|
After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit 0e85c12. None of the specified runs were found on the Conbench server. The full Conbench report has more details. |
Rationale for this change
Arrow C++ slices arrays by bumping the top-level
offsetvalue.However, Arrow Rust slices list arrays by slicing the
value_offsetsbuffer. When receiving a Rust Arrow Array in C++ (via the C data
interface), its IPC serialization fails to notice that the
value_offsetsbuffer needed to be updated, but it still updates thevaluesbuffer. This leads to a corrupt array on deserialization, withan
value_offsetsbuffer that points past the end of the values array.This PR fixes the IPC serialization by also looking at value_offset(0) to
determine whether the
value_offsetsbuffer needs reconstructing,instead of only looking at offset().
This works because value_offset(int) is the offets buffer, shifted by the top-level offset.
We still need to check for offset(), to account for array starting with an empty list (multiple
zeroes at the start of the offsets buffer).
What changes are included in this PR?
The fix and nothing else
Are these changes tested?
Yes
Are there any user-facing changes?
No (well, unless they are affected by the bug)
This PR contains a "Critical Fix". (the changes fix (b) a bug that caused incorrect or invalid data to be produced) : valid operations on valid data produce invalid data.