-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded #34408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The skeleton looks good to me
|
@jorisvandenbossche @westonpace @zeroshade this now has the implementation of all the new functions, changes in behavior, and tests. Ready for review. |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor nits but this looks pretty well thought out
|
@westonpace I'm a bit lost on how these labels change and what they actully mean, but the PR is ready for review again. |
|
@ursabot please benchmark |
|
Benchmark runs are scheduled for baseline = fc95019 and contender = ceacb2ef16a455cf73d6ad31d82928f185646597. Results will be available as each benchmark for each run completes. |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more questions. I also (hopefully) triggered conbench just to double check impact (though I don't know if we have any null-specific benchmarks).
cpp/src/arrow/array/array_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have existing tests like this for sparse and dense union? (given we now have distinct paths for those)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned you in PR comments pointing to code that tests IsNull/Valid for unions.
cpp/src/arrow/array/data.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What prevents these functions from being .cc only function in an anonymous namespace? It seems like the only need is because we have functions later in data.h that could be in a .cc file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are called from IsNull and IsValid which are inline and having the switch on the type ID inlined allows the compiler to inline the highly predictable branches into the loop bodies from which these functions are called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could introduce an IsValidFallback function that handles the case when the validity buffer is NULLPTR, then all these could become .cc only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok how it is. I just don't ever have a good sense for when it's justified for leaving something in the header file for performance reasons. Typically I think it would be nice for there to be some benchmark to point to. Otherwise "it seems like this is important for performance" becomes too subjective of a criteria to apply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing benchmarks indicated some regression (maybe due to union arrays being part of IsNull benchmarks) so I will investigate more carefully.
- Add REE to TestNullCount tests - Test new null handling functions in TestMakeArrayOfNull
|
Benchmark runs are scheduled for baseline = 84e5430 and contender = fde31ed. fde31ed is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
| const int8_t child_id = sparse_union_type->child_ids()[types[span.offset + i]]; | ||
|
|
||
| null_count += span.child_data[child_id].IsNull(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you test this with a sliced array? (don't directly see it in the tests)
Because in #35036, I based my implementation on this function, and I am finding that I need to remove/add the offset like so:
| const int8_t child_id = sparse_union_type->child_ids()[types[span.offset + i]]; | |
| null_count += span.child_data[child_id].IsNull(i); | |
| const int8_t child_id = sparse_union_type->child_ids()[types[i]]; | |
| null_count += span.child_data[child_id].IsNull(span.offset + i); |
So types, which is the result of Span.GetValues() already takes into account the span's offset, but IsNull() is called on the child_data, and accessing child_data gives the original array data without offset already taken into account.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a high chance the existing tests didn't cover sliced union arrays and an even higher chance I messed up here because I had just read the Arrow spec on Unions to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not so much about the Arrow spec on Unions, but rather about the small details of our APIs (Span::GetValues returning sliced values or not, Span::child_data returning sliced data or not?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetValues applies slices. Direct access doesn't. I think that makes sense, but it's hard to keep in mind at all times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
child_ids() is unclear to me because it's a cache if I recall correctly.
…rvers Now that all the benchmark builds are running again. :fingers-crossed: This reverts commit fde31ed.
…hout bitmaps like Unions and Run-End Encoded (apache#34408) Bonus: add `ArrayData::IsValid()` to make it consistent with `Array` and `ArraySpan`. ### Rationale for this change This is the proposed fix to apache#34361 plus the addition of more APIs dealing with validity/nullity. ### What changes are included in this PR? This PR changes the behavior of `IsNull` and `IsValid` in `Array`, `ArrayData`, and `ArraySpan`. It preserves the behavior of `MayHaveNulls`, `GetNullCount` and introduces new member functions to `Array`, `ArrayData`, and `ArraySpan`: - `bool HasValidityBitmap() const` - `bool MayHaveLogicalNulls() const` - `int64_t ComputeLogicalNullCount() const` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. See above. Breakage with these changes can only happen if users rely on `IsNull(i)` always returning `true` for union types, but we have users reporting that the current behavior or broken apache#34315. This is why the behavior of `IsNull` and `IsValid` is changing., **This PR contains a "Critical Fix".** * Closes: apache#34361 Authored-by: Felipe Oliveira Carvalho <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…hout bitmaps like Unions and Run-End Encoded (apache#34408) Bonus: add `ArrayData::IsValid()` to make it consistent with `Array` and `ArraySpan`. ### Rationale for this change This is the proposed fix to apache#34361 plus the addition of more APIs dealing with validity/nullity. ### What changes are included in this PR? This PR changes the behavior of `IsNull` and `IsValid` in `Array`, `ArrayData`, and `ArraySpan`. It preserves the behavior of `MayHaveNulls`, `GetNullCount` and introduces new member functions to `Array`, `ArrayData`, and `ArraySpan`: - `bool HasValidityBitmap() const` - `bool MayHaveLogicalNulls() const` - `int64_t ComputeLogicalNullCount() const` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. See above. Breakage with these changes can only happen if users rely on `IsNull(i)` always returning `true` for union types, but we have users reporting that the current behavior or broken apache#34315. This is why the behavior of `IsNull` and `IsValid` is changing., **This PR contains a "Critical Fix".** * Closes: apache#34361 Authored-by: Felipe Oliveira Carvalho <[email protected]> Signed-off-by: Matt Topol <[email protected]>

Bonus: add
ArrayData::IsValid()to make it consistent withArrayandArraySpan.Rationale for this change
This is the proposed fix to #34361 plus the addition of more APIs dealing with validity/nullity.
What changes are included in this PR?
This PR changes the behavior of
IsNullandIsValidinArray,ArrayData, andArraySpan.It preserves the behavior of
MayHaveNulls,GetNullCountand introduces new member functions toArray,ArrayData, andArraySpan:bool HasValidityBitmap() constbool MayHaveLogicalNulls() constint64_t ComputeLogicalNullCount() constAre these changes tested?
Yes.
Are there any user-facing changes?
Yes. See above.
Breakage with these changes can only happen if users rely on
IsNull(i)always returningtruefor union types, but we have users reporting that the current behavior or broken #34315. This is why the behavior ofIsNullandIsValidis changing.,This PR contains a "Critical Fix".