-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-13064: [C++] Implement select ('case when') function for fixed-width types #10557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3e7e6cb to
a00643e
Compare
|
Note there's some code here for fixed-width types that duplicates what's in ARROW-9430/#10412. They should probably get unified. |
|
To start my usual round of name bike-shedding ;), I would propose to not call this "select", to avoid the confusion with SQL's |
|
|
|
Ok for |
|
Renamed to |
|
It looks like this needs rebasing. |
|
Thanks for the heads up, rebased. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, thank you. Do you want to add a couple benchmarks for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but would be nice to normalize includes here. I think we normally use #include "arrow/..." for intra-Arrow inclusions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead use ValidateOutput from kernels/test_util.h. It will also check that no data is left uninitialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should be moved to kernels/test_util.{h,cc} instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also worth noting: this seems to be very similar in function to CheckWithDifferentShapes. I think it'd be worthwhile to unify these two and promote it to test_util.h so it can be reused by more varargs scalar kernels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also somewhat overlapping with CheckScalar now that I look at it. I'll take a look and see if I can't consolidate all three. (I may do so in a different PR until we decide what to do with the implementation here, if we want to split it up into separate 'case' and 'when' functions or not as suggested.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also {scalar_false, values2, scalar_true, values1} -> values1?
cpp/src/arrow/compute/api_scalar.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So a condition being null is the same as being false, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, null is the same as false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the definition of ReplaceTypes, this check doesn't seem necessary?
cc @bkietz for a second opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me we could restructure "case_when" and yield a less daunting interface:
case(
when(cond_0, value_0),
when(cond_1, value_1),
value_else
)Where when masks slots with null wherever its condition is not true and case is only a variadic coalescing function (takes the first non null).
We'd be allocating a new null bitmap on each call to when which is not ideal. However typing is far clearer (when(cond: Boolean, value: T): T, case(...values: T): T) and case/when can be independently unit tested.
@pitrou what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: variadic coalesce is independently useful https://issues.apache.org/jira/browse/ARROW-13136
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to transpose this would be:
case(
first_true_in(cond_0, cond1),
value0,
value1,
value_else
)Where first_true_in returns the index of the first true and case's first argument is an index into the rest of its arguments:
assert first_true_in([True, False, False], [True, True, False]) == [0, 1, None]
assert case([0, 1, None], 'a b c'.split(), 'd e f'.split(), 'x y z'.split()) == 'a e z'.split()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the second example, case would essentially be 'choose' (ARROW-13220). I'm curious which would be faster; the coalesce PR is implemented columnwise (for fixed-width types) and tries to do block copies, while the 'choose' PR I'm looking at is necessarily rowwise. Probably depends on the distribution of the inputs.
Might the masking 'when' kernel and 'first_true_in' be useful kernels to have anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that
case(
when(cond_0, value_0),
when(cond_1, value_1),
value_else
)
is not necessarily identical to case_when(cond_0, value_0, cond_1, value_1, value_else) as done in this PR, when it comes to nulls in the values?
For example, if you have a null in value_0, it should be chosen if the corresponding bit of cond_0 is True. However, if you implement that with a masking when and then coalesce (taking the first non null), it would not pick the null value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, you're right…
We can still try the 'choose' based version though I would expect that to be slower since it works rowwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick check with choose shows that it's only about 2/3 as fast:
CaseWhenBench32/1048576/0 48183534 ns 48183323 ns 13 bytes_per_second=334.659M/s
CaseWhenBench64/1048576/0 48610718 ns 48610432 ns 13 bytes_per_second=660.866M/s
CaseWhenBench32/1048576/99 48226819 ns 48226677 ns 15 bytes_per_second=334.327M/s
CaseWhenBench64/1048576/99 48891957 ns 48892172 ns 13 bytes_per_second=656.996M/s
though note we have to call both first_true_in, then fill_null, to get the 'else' behavior which doesn't help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhere, you should also test what happens when case_when is called with zero arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of uint8_t* out_values, you may want this to take a ArrayData* out, since you'll need it for non-fixed-width types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For non-fixed-width types, it might be simpler to handle it with a builder and AppendScalar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit excessive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be faster as written, oddly.
Before:
-----------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------
CaseWhenBench32/1048576/0 31933112 ns 31932368 ns 22 bytes_per_second=504.974M/s
CaseWhenBench64/1048576/0 33170481 ns 33168735 ns 21 bytes_per_second=968.533M/s
CaseWhenBench32/1048576/99 32487300 ns 32487411 ns 21 bytes_per_second=496.299M/s
CaseWhenBench64/1048576/99 33682029 ns 33680901 ns 21 bytes_per_second=953.715M/s
CaseWhenBench32Contiguous/1048576/0 7255445 ns 7255387 ns 96 bytes_per_second=1.632G/s
CaseWhenBench64Contiguous/1048576/0 7932437 ns 7932171 ns 88 bytes_per_second=2.97013G/s
CaseWhenBench32Contiguous/1048576/99 7526742 ns 7526709 ns 92 bytes_per_second=1.57303G/s
CaseWhenBench64Contiguous/1048576/99 8172498 ns 8172239 ns 83 bytes_per_second=2.88261G/s
After:
-----------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------
CaseWhenBench32/1048576/0 44166172 ns 44165634 ns 16 bytes_per_second=365.103M/s
CaseWhenBench64/1048576/0 44605356 ns 44603995 ns 16 bytes_per_second=720.227M/s
CaseWhenBench32/1048576/99 44867670 ns 44867051 ns 16 bytes_per_second=359.361M/s
CaseWhenBench64/1048576/99 45077818 ns 45076721 ns 15 bytes_per_second=712.607M/s
CaseWhenBench32Contiguous/1048576/0 17757494 ns 17757271 ns 39 bytes_per_second=682.819M/s
CaseWhenBench64Contiguous/1048576/0 18236327 ns 18235892 ns 38 bytes_per_second=1.29193G/s
CaseWhenBench32Contiguous/1048576/99 15051281 ns 15051008 ns 39 bytes_per_second=805.518M/s
CaseWhenBench64Contiguous/1048576/99 15504081 ns 15503998 ns 45 bytes_per_second=1.51944G/s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use GetNullCount() I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem significantly different from CheckDispatchBest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree; there's too much indirection here to see what's actually being tested. Please inline this some
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also worth noting: this seems to be very similar in function to CheckWithDifferentShapes. I think it'd be worthwhile to unify these two and promote it to test_util.h so it can be reused by more varargs scalar kernels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::copy allows the ranges to overlap (unlike memcpy), so for simple pointers like this it gets inlined to memmove. Memmove can be slower (or faster) than memcpy; depends on your libc. It shouldn't differ by a very wide margin though so I don't think it's necessary to avoid std::copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me we could restructure "case_when" and yield a less daunting interface:
case(
when(cond_0, value_0),
when(cond_1, value_1),
value_else
)Where when masks slots with null wherever its condition is not true and case is only a variadic coalescing function (takes the first non null).
We'd be allocating a new null bitmap on each call to when which is not ideal. However typing is far clearer (when(cond: Boolean, value: T): T, case(...values: T): T) and case/when can be independently unit tested.
@pitrou what do you think?
|
I've changed this so the signature is |
|
What's the advantage of using a struct of boolean arrays as first argument instead of a vector? (eg for a struct you need to give it (dummy) names) |
|
@bkietz and I discussed this. Mostly it simplifies the function signature a lot and gets rid of the custom typechecking code in DispatchBest. It does require materializing scalar conditions and would need callers to do some pre-processing if they wish to avoid that (e.g. dropping any values associated with a scalar false/null condition, using values associated with a scalar true condition as the 'else' clause). It's mostly a point of discussion. I realized I forgot to benchmark so let me do that now. |
|
This is quite faster though (actually almost suspiciously so; also, the benchmark loop doesn't include constructing a structarray from individual boolean arrays). Struct: Alternating arguments: (EDIT: updated since I forgot to rebuild first) |
|
Any other thoughts here? If we're ok with CaseWhen(struct(bool...), T...) over CaseWhen(bool, T, bool, T, ...) I can make this kernel also support writing into slices. |
|
I'm fine with |
|
Filed ARROW-13325. |
|
Updated to support can_write_into_slices. Also, updated the benchmark bytes_per_second scaling to reflect the size of the output and not the size of all the inputs so the numbers are more reasonable. |
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments, otherwise this looks good to go
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and some of the other comments) need to be updated with the new signature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| auto null_scalar = MakeNullScalar(boolean()); | |
| ASSERT_OK_AND_ASSIGN(auto nulls, | |
| MakeArrayFromScalar(*null_scalar, len - 2 * (len / 3))); | |
| ASSERT_OK_AND_ASSIGN(auto nulls, MakeArrayOfNull(boolean(), len - 2 * (len / 3))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| auto cond1 = std::static_pointer_cast<BooleanArray>( | |
| rand.ArrayOf(boolean(), len, /*null_probability=*/0.01)); | |
| auto cond2 = std::static_pointer_cast<BooleanArray>( | |
| rand.ArrayOf(boolean(), len, /*null_probability=*/0.01)); | |
| auto cond3 = std::static_pointer_cast<BooleanArray>( | |
| rand.ArrayOf(boolean(), len, /*null_probability=*/0.01)); | |
| auto fld = field("cond", boolean(), key_value_metadata({{"null_probability", "1.0"}})); | |
| auto cond = rand.ArrayOf(field("", struct_({fld, fld, fld}), len); |
?
|
Fixed docstrings and fixed a TODO I noticed about fully initializing the output buffer. Also added a benchmark case for when both the cond struct array and the child cond boolean arrays can have nulls. This case is especially terrible, I made a slight optimization to eliminate one of the more egregious offenders I saw in perf, but it's still bad even then: |
|
@lidavidm what would you think about just raising an error for top level nulls? It doesn't seem like a useful case to me |
|
Should be fine. It would certainly trim down the inner loop a lot. |
|
Removed support for toplevel nulls. |
As mentioned on the meeting, I don't have a strong opinion on the exact signature (assuming you can create the inputs easily in the bindings either way), and if it simplifies the implementation / signature quite a bit, that sounds as a good reason. I quickly tried it out in Python: What I find a little bit annoying is that I have to provide "dummy" names (that are never used) to my boolean conditions in |
Co-authored-by: Joris Van den Bossche <[email protected]>
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@jorisvandenbossche I'll make a PR for ARROW-11206 shortly and allow make_struct to be called with no names provided (defaulting to empty names or str(field_index) or so). That should enable you to write
pc.case_when(
pc.make_struct([True, False, None],
[False, True, None]),
[1, 2, 3],
[11, 12, 13])| ASSERT_FALSE(sig.MatchesInputs(args)); | ||
| } | ||
| { | ||
| KernelSignature sig({int8(), utf8()}, utf8(), /*is_varargs=*/true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working this out
|
@bkietz sounds good! |
This doesn't support variable-width types (e.g. strings) as the implementation here is columnwise. I will work on those separately (they require a rowwise implementation).
Also fixes a small bug in the CommonNumericType implementation (I noticed uint8 was getting promoted to int8).