ARROW-9430: [C++] Implement replace_with_mask kernel #10412

lidavidm · 2021-05-27T15:05:04Z

This implements a kernel equivalent to NumPy's arr[mask] = [values], i.e. given an array and an equal-length (or scalar) boolean mask, along with an array of replacement values passed via options, each array item for which the corresponding mask value is true is replaced with the next value from the replacement value array.

lidavidm · 2021-05-27T15:05:34Z

Note once #10410 goes through, this should probably be consolidated into the same file (scalar_if_else.cc).

github-actions · 2021-05-27T15:48:13Z

https://issues.apache.org/jira/browse/ARROW-9430

lidavidm · 2021-05-27T16:07:12Z

(N.B. needs some more work - trying to add a test with random data)

jorisvandenbossche · 2021-06-02T08:12:39Z

High-level question: is there a reason that the replacement values are passed through an options struct, and not as third argument? (because that is an array of a different length? But eg for "take" the indices are not in an options struct)

And if we can start bike-shedding about the name .. ;) For me, "override_mask" sounds like it would replace the validity mask of the array. So if we use "override" as verb, I would at least make it "overrride_with_mask" or so. But "replace" or "set_values" might also be possible verbs (and with the different variations of this kernel, it could eg be "replace_with_mask", "replace_with_indices", "replace_with_mapping")

lidavidm · 2021-06-02T12:21:19Z

It's in a struct because it's a different length, yes. We could make it a vector kernel instead of a scalar kernel and then the replacements could be passed as another argument. I'm not sure which would be more useful though.

For the name, replace_with_mask etc. sound good. CC @nirandaperera so he's aware for ARROW-9431 as well.

nirandaperera · 2021-06-02T12:42:08Z

@lidavidm @jorisvandenbossche @bkietz I'm also thinking about how to handle different length'd arrays for ARROW-9431. Like David said, compute infrastructure guarantees that all arrays passed to the function are of the same length. If we are going ahead with the Options approach, for ARROW-9431, it would be a unary function and other 2 arrays in the options.
So should we make ARROW-9430 and ARROW-9431 vector kernels instead?

nirandaperera · 2021-06-02T14:23:09Z

BTW we need to add docs to the PR I think 🙂

lidavidm · 2021-06-02T14:26:56Z

Yup, good catch, though maybe let's decide if this is to be a vector or scalar kernel (as I also need to add Python bindings)

lidavidm · 2021-06-02T22:12:38Z

This is now a vector kernel, with docs, that should support any fixed-size type. However I still need to add support for binary types which I expect people would want to use.

bkietz · 2021-06-03T18:47:42Z

FWIW: I'd guess this should be a vector kernel. As a thought experiment: I don't think it'd ever be correct to use this function from a query expression even in its scalar form since that'd require prior knowledge of the number of set bits in the mask

jorisvandenbossche · 2021-06-04T11:40:18Z

@bkietz I think it's correct that this will probably not be used in a typical query execution context. Its main target is to allow to mimic "setitem" operations in eg pandas (arr[mask] = val).

jorisvandenbossche

Cool, I added a few small doc comments (didn't look at the C++ implementation in detail). The ReplaceWithOptions class still needs to be exposed in Python (you get an import warning at the moment about it)

jorisvandenbossche · 2021-06-04T11:51:17Z

cpp/src/arrow/compute/api_vector.h

Suggested change

/// to a true value in the mask with the next element from `options`.

/// to a true value in the mask with the next element from `replacements`.

jorisvandenbossche · 2021-06-04T11:55:45Z

cpp/src/arrow/compute/kernels/vector_replace.cc

Suggested change

"each corresponding element of the mask will be replaced by the next "

"each corresponding true value of the mask will be replaced by the next "

?

This made no sense on a re-read so I reworded the docs here.

jorisvandenbossche · 2021-06-04T11:56:51Z

cpp/src/arrow/compute/kernels/vector_replace.cc

I think popcnt will not be clear for many users. Maybe something like sum(mask == true) ?

cpp/src/arrow/compute/kernels/vector_replace_test.cc

jorisvandenbossche · 2021-06-04T12:04:12Z

docs/source/cpp/compute.rst

Non-fixed width and binary is now also supported already?

jorisvandenbossche · 2021-06-04T12:23:32Z

cpp/src/arrow/compute/kernels/vector_replace.cc

Suggested change

"value of the replacements (or null if the mask is null)."

"value of the replacements (or null if the mask is null). "

lidavidm · 2021-06-04T13:08:03Z

I removed ReplaceWithOptions when I made this a vector kernel, so I fixed up the docs - thanks for catching that.

cpp/src/arrow/compute/kernels/vector_replace.cc

lidavidm · 2021-06-14T20:12:16Z

@bkietz @jorisvandenbossche I know y'all are busy, but any other comments? Once this is in, @nirandaperera can get started on ARROW-9431 on top of this

jorisvandenbossche

I did another pass. Looking good to me (but I just checked the docs and code logic, for the C++ implementation details I will defer to someone more knowledgeable)

Should there maybe be some tests where the array, mask or replacement are sliced? (have an offset) Or is that already covered / not a typical risk to go wrong?

jorisvandenbossche · 2021-06-15T08:22:10Z

cpp/src/arrow/compute/kernels/vector_replace.cc

Suggested change

ARROW_ASSIGN_OR_RAISE(auto array,

MakeArrayOfNull(array.type, array.length, ctx->memory_pool()));

*output = *array->data();

ARROW_ASSIGN_OR_RAISE(auto replacement_array,

MakeArrayOfNull(array.type, array.length, ctx->memory_pool()));

*output = *replacement_array->data();

(to be consistent with below, and not confuse with the input array which uses the same variable name)

jorisvandenbossche · 2021-06-15T09:16:59Z

cpp/src/arrow/compute/kernels/vector_replace.cc

Maybe add a comment to mention this is setting the out_bitmap to all valid?

@lidavidm I think it will be better to use BitUtil::SetBitsTo/SetBitmap here. It would more precisely set values upto the [offset, offset+length).

We might over-allocate a bit and we should make sure any such bytes are initialized. Also I'd guess it's faster to just bit-blit a constant value over a buffer rather than try to set bits precisely. (SetBitsTo i s quite a bit more complicated.)

I guess if we wanted to support can_write_into_slices then we'd need SetBitsTo here.

Yes, that would prevent stepping on other slice's result.

jorisvandenbossche · 2021-06-15T09:25:09Z

cpp/src/arrow/compute/kernels/vector_replace.cc

Hasn't this already been done in the else if (valid_block.NoneSet()) { block above?

That branch only applies some of the time, but you are right in that there's no point doing it above since it's replicated here. (However the branch should be kept to skip those values.)

lidavidm · 2021-06-15T12:23:33Z

The ReplaceWithMaskRandom tests do test sliced arrays. (Unlike the scalar tests we can't test slices just by slicing the given example inputs, unfortunately.)

jorisvandenbossche · 2021-06-22T11:28:07Z

@bkietz do you want to take another look here?

lidavidm · 2021-06-25T17:36:38Z

Note there's some code here for handling fixed-width types that now duplicates what's in ARROW-13064/#10557. We should probably unify those at some point (after one or the other merges).

ursabot · 2021-07-08T14:44:56Z

Supported benchmark command examples:

@ursabot benchmark help

To run all benchmarks:
@ursabot please benchmark

To filter benchmarks by language:
@ursabot please benchmark lang=Python
@ursabot please benchmark lang=C++
@ursabot please benchmark lang=R
@ursabot please benchmark lang=Java

To filter Python and R benchmarks by name:
@ursabot please benchmark name=file-write
@ursabot please benchmark name=file-write lang=Python
@ursabot please benchmark name=file-.*

To filter C++ benchmarks by archery --suite-filter and --benchmark-filter:
@ursabot please benchmark command=cpp-micro --suite-filter=arrow-compute-vector-selection-benchmark --benchmark-filter=TakeStringRandomIndicesWithNulls/262144/2 --iterations=3

For other command=cpp-micro options, please see https://github.com/ursacomputing/benchmarks/blob/main/benchmarks/cpp_micro_benchmarks.py

lidavidm · 2021-07-08T14:46:12Z

@ursabot please benchmark lang=C++

ursabot · 2021-07-08T14:47:12Z

Benchmark runs are scheduled for baseline = 7eea2f5 and contender = f79438d96c42b7728c3f9860aadad545cc5ac483. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Provided benchmark filters do not have any benchmark groups to be executed on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2 (mimalloc)
[Skipped ⚠️ Only ['Python', 'R'] langs are supported on ursa-i9-9960x] ursa-i9-9960x (mimalloc)
[Finished ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q (mimalloc)
Supported benchmarks:
ursa-i9-9960x: langs = Python, R
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

lidavidm · 2021-07-13T12:48:58Z

Final comments from anyone?

nirandaperera · 2021-07-13T15:26:14Z

cpp/src/arrow/compute/kernels/vector_replace.cc

would this be called for a FixedWidthType execution? If so, this might conflict with computed_preallocate option. 🤔

same applies to other mem allocations in the function.

You are right, this needs to be a bit smarter.

I also notice that it doesn't handle the case where len(replacements) > len(mask).

nirandaperera · 2021-07-13T15:28:27Z

cpp/src/arrow/compute/kernels/vector_replace.cc

Suggested change

const CopyBitmap copy_bitmap, const uint8_t* mask_bitmap,

const CopyBitmap &copy_bitmap, const uint8_t* mask_bitmap,

In this case I intentionally made CopyBitmap itself cheaper to copy than to use as a reference - it's <= 2 words though I suppose the compiler will optimize it identically either way.

Except, doing this does seem about ~5% slower:

Before: ------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------- ReplaceWithMaskLowSelectivityBench/16384/0 33631 ns 33631 ns 204971 bytes_per_second=3.62975G/s ReplaceWithMaskLowSelectivityBench/16384/99 35018 ns 35017 ns 202363 bytes_per_second=3.46498G/s ReplaceWithMaskHighSelectivityBench/16384/0 77268 ns 77267 ns 90912 bytes_per_second=1.57985G/s ReplaceWithMaskHighSelectivityBench/16384/99 75751 ns 75750 ns 92444 bytes_per_second=1.60176G/s After: ------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------- ReplaceWithMaskLowSelectivityBench/16384/0 35512 ns 35511 ns 192582 bytes_per_second=3.43751G/s ReplaceWithMaskLowSelectivityBench/16384/99 36702 ns 36701 ns 191996 bytes_per_second=3.30598G/s ReplaceWithMaskHighSelectivityBench/16384/0 82957 ns 82956 ns 85194 bytes_per_second=1.47151G/s ReplaceWithMaskHighSelectivityBench/16384/99 80415 ns 80413 ns 86354 bytes_per_second=1.50887G/s

interesting 😄

nirandaperera · 2021-07-13T15:34:27Z

cpp/src/arrow/compute/kernels/vector_replace.cc

If we can create a BooleanArray from mask, then we can call true_count straight away! It does this internally IINM.

nirandaperera · 2021-07-13T15:51:58Z

cpp/src/arrow/compute/kernels/vector_replace.cc

I have a feeling that we should leave all kernels as following,

kernel.null_handling = NullHandling::type::COMPUTED_NO_PREALLOCATE; kernel.mem_allocation = MemAllocation::type::NO_PREALLOCATE;

and later, change these flags. One thing I realized was, NullHandling::COMPUTED_PREALLOCATE, and MemAllocation::PREALLOCATE introduces a lot of niche cases. It was helpful for me to test those cases using compute::CheckScalar test util. It checks for slicing, chunks etc for scalar functions.

arrow/cpp/src/arrow/compute/kernels/test_util.cc

Line 106 in e990d17

void CheckScalar(std::string func_name, const DatumVector& inputs,

But now that we are on vector functions, the semantics might change :-)

@bkietz WDYT?

It is unfortunate that we don't have a versatile utility for vector functions like CheckScalar. One way to verify correct writing into slices would be: run the function once to ensure output is correctly allocated/shaped/etc, then invoke the kernel directly into a slice of that output. If everything is working as it should, the kernel should simply overwrite that slice with new values, leaving values outside the slice untouched.

Actually: I'm not sure if it even makes sense for this kernel to write into a slice, since it needs the entirety of all its arguments to execute. So if we manually force it to write into a slice of the output, it'd write different results.

lidavidm · 2021-07-14T15:22:01Z

I think I've addressed all the feedback here.

nirandaperera · 2021-07-14T15:30:25Z

I think I've addressed all the feedback here.

I'm +1 for this!

lidavidm · 2021-07-14T15:42:33Z

Merged, thanks. This should unblock ARROW-9431 if you do still plan to look at it.

jorisvandenbossche · 2021-07-15T10:07:01Z

Nice to see this one merged, thanks all!

github-actions bot added the Component: C++ label May 27, 2021

lidavidm marked this pull request as draft May 27, 2021 16:07

lidavidm marked this pull request as ready for review May 28, 2021 13:27

lidavidm force-pushed the arrow-9430 branch from af22846 to 7c78aae Compare June 2, 2021 13:57

lidavidm changed the title ~~ARROW-9430: [C++] Implement override_mask kernel~~ ARROW-9430: [C++] Implement replace_with_mask kernel Jun 2, 2021

lidavidm marked this pull request as draft June 2, 2021 15:51

lidavidm marked this pull request as ready for review June 3, 2021 17:24

jorisvandenbossche reviewed Jun 4, 2021

View reviewed changes

bkietz reviewed Jun 7, 2021

View reviewed changes

cpp/src/arrow/compute/kernels/vector_replace.cc Outdated Show resolved Hide resolved

lidavidm force-pushed the arrow-9430 branch 2 times, most recently from 16779ec to cf41f39 Compare June 14, 2021 12:32

jorisvandenbossche reviewed Jun 15, 2021

View reviewed changes

lidavidm force-pushed the arrow-9430 branch from 742ba11 to 0a33834 Compare June 21, 2021 12:34

lidavidm force-pushed the arrow-9430 branch from 0a33834 to 57b04ee Compare June 24, 2021 17:20

lidavidm force-pushed the arrow-9430 branch from 629e388 to f79438d Compare July 8, 2021 14:44

lidavidm force-pushed the arrow-9430 branch from f79438d to d93d72c Compare July 12, 2021 20:29

nirandaperera reviewed Jul 13, 2021

View reviewed changes

lidavidm force-pushed the arrow-9430 branch from d93d72c to 49dd437 Compare July 13, 2021 18:49

lidavidm added 15 commits July 13, 2021 16:07

ARROW-9430: [C++] Implement override_mask kernel

b249eda

ARROW-9430: [C++] Clarify replace_with_mask implementation

ccca017

ARROW-9430: [C++] Cross-reference if_else and replace_with_mask

ea1517a

ARROW-9430: [C++] Clean up tests slightly

853eee3

ARROW-9430: [C++] Move test helper into test_util.h

e154930

ARROW-9430: [C++] Take advantage of preallocation

97fabbe

ARROW-9430: [C++] Clean up impl

68b733b

ARROW-9430: [C++] Add simple benchmark

ba75c90

ARROW-9430: [C++] Count replacements up front

4ba9837

ARROW-9430: [C++] Fix min/max in benchmark

5b2de61

ARROW-9430: [C++] Improve performance

0fc1995

ARROW-9430: [C++] Actually run format

17e3cc3

ARROW-9430: [C++] Preallocate validity buffer too

3a56c31

ARROW-9430: [C++] Fix replacement array > input array

b5b656a

ARROW-9430: [C++] Properly use preallocation in scalar mask case

6fa62de

lidavidm force-pushed the arrow-9430 branch from 49dd437 to 6fa62de Compare July 13, 2021 20:08

lidavidm closed this in 6db88a9 Jul 14, 2021

asfimport mentioned this pull request Nov 22, 2021

[C++/Python] Kernel for SetItem(BooleanArray, values) #25504

Closed

	/// to a true value in the mask with the next element from `options`.
	/// to a true value in the mask with the next element from `replacements`.

	"each corresponding element of the mask will be replaced by the next "
	"each corresponding true value of the mask will be replaced by the next "

	"value of the replacements (or null if the mask is null)."
	"value of the replacements (or null if the mask is null). "

-    ARROW_ASSIGN_OR_RAISE(auto array,
-                          MakeArrayOfNull(array.type, array.length, ctx->memory_pool()));
-    *output = *array->data();
+    ARROW_ASSIGN_OR_RAISE(auto replacement_array,
+                          MakeArrayOfNull(array.type, array.length, ctx->memory_pool()));
+    *output = *replacement_array->data();

	const CopyBitmap copy_bitmap, const uint8_t* mask_bitmap,
	const CopyBitmap &copy_bitmap, const uint8_t* mask_bitmap,

ARROW-9430: [C++] Implement replace_with_mask kernel #10412

ARROW-9430: [C++] Implement replace_with_mask kernel #10412

Uh oh!

Conversation

lidavidm commented May 27, 2021

Uh oh!

lidavidm commented May 27, 2021

Uh oh!

github-actions bot commented May 27, 2021

Uh oh!

lidavidm commented May 27, 2021

Uh oh!

jorisvandenbossche commented Jun 2, 2021

Uh oh!

lidavidm commented Jun 2, 2021

Uh oh!

nirandaperera commented Jun 2, 2021

Uh oh!

nirandaperera commented Jun 2, 2021

Uh oh!

lidavidm commented Jun 2, 2021

Uh oh!

lidavidm commented Jun 2, 2021

Uh oh!

bkietz commented Jun 3, 2021

Uh oh!

jorisvandenbossche commented Jun 4, 2021

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Jun 4, 2021

Uh oh!

Uh oh!

lidavidm commented Jun 14, 2021

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Jun 15, 2021

Uh oh!

jorisvandenbossche commented Jun 22, 2021

Uh oh!

lidavidm commented Jun 25, 2021

Uh oh!

ursabot commented Jul 8, 2021

Uh oh!

lidavidm commented Jul 8, 2021

Uh oh!

ursabot commented Jul 8, 2021 •

edited

Loading