ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups #13509

jvanstraten · 2022-07-04T14:17:11Z

The min/max aggregate compute kernels seemed to discard their state between partitions, so they would only aggregate the last partition they see (in each thread).

This is the simplest change I could come up with to fix this, but honestly I'm not sure why the local variable even exists. It seems to me it could just be replaced with this->state directly, since there doesn't seem to be any failure path where this->state isn't updated from local. Am I missing something?

ETA: I tried to make a test case for this, only to find that there is already a test case for this. In that case however, it seems that the merging of the partition results is done by Merge'ing the result of separate Consume calls, rather than chaining multiple Consume calls. I'm not sure how to trigger the latter behavior from a normal C++ test case.

github-actions · 2022-07-04T14:17:35Z

https://issues.apache.org/jira/browse/ARROW-16700

lidavidm

Hmm, don't think you're missing anything.

It seems you should be able to hit this by sending multiple batches through the plan, but I would have thought that's already tested.

jvanstraten · 2022-07-05T15:05:58Z

Chunked arrays are tested here, but, as cleaned up from my debug prints for chunked_input1, the call pattern there is

auto min_max_impl_1 = MinMaxImpl(...)
auto min_max_impl_2 = MinMaxImpl(...)
min_max_impl_2.Consume([5,1,2,3,4]) -> (1, 5)
min_max_impl_1.MergeFrom(min_max_impl_2) -> (1, 5)
auto min_max_impl_3 = MinMaxImpl(...)
min_max_impl_3.Consume([9,1,null,3,4]) -> (1, 9)
min_max_impl_1.MergeFrom(min_max_impl_3) -> (1, 9)
min_max_impl_1.Finalize() -> (1, 9)

for which it does not matter whether Consume() overrides the previous state. The test cases aren't great anyway since the last chunk has the min and max values for each of them, but even if I'd swap one of them around you'd still get

auto min_max_impl_1 = MinMaxImpl(...)
auto min_max_impl_2 = MinMaxImpl(...)
min_max_impl_2.Consume([9,1,null,3,4]) -> (1, 9)
min_max_impl_1.MergeFrom(min_max_impl_2) -> (1, 9)
auto min_max_impl_3 = MinMaxImpl(...)
min_max_impl_3.Consume([5,1,2,3,4]) -> (1, 5)
min_max_impl_1.MergeFrom(min_max_impl_3) -> (1, 9)
min_max_impl_1.Finalize() -> (1, 9)

I don't know why it's doing it this way. Seems rather inefficient to me. Evidently though, for larger-scale workloads it does call Consume() more than once before merging and throwing away each MinMaxImpl instance.

lidavidm · 2022-07-05T15:28:31Z

My bad, I was looking at the hash aggregate tests!

I suppose the 'right' way to test it is to construct an ExecPlan and feed the data through. There are some tests in plan_test.cc but it doesn't have much coverage of the kernels themselves. We may need some parameterization/a helper to test "both ways" of calling aggregates in much the same way hash_aggregate_test.cc does.

lidavidm · 2022-07-05T15:29:28Z

The behavior you're seeing stems from this:

arrow/cpp/src/arrow/compute/exec.cc

Lines 1124 to 1177 in 3d6240c

    
           class ScalarAggExecutor : public KernelExecutorImpl<ScalarAggregateKernel> { 
        
            public: 
        
             Status Init(KernelContext* ctx, KernelInitArgs args) override { 
        
               input_descrs_ = &args.inputs; 
        
               options_ = args.options; 
        
               return KernelExecutorImpl<ScalarAggregateKernel>::Init(ctx, args); 
        
             } 
        
             Status Execute(const ExecBatch& args, ExecListener* listener) override { 
        
               return ExecuteImpl(args.values, listener); 
        
             } 
        
             Status ExecuteImpl(const std::vector<Datum>& args, ExecListener* listener) { 
        
               ARROW_ASSIGN_OR_RAISE( 
        
                   batch_iterator_, ExecBatchIterator::Make(args, exec_context()->exec_chunksize())); 
        
               ExecBatch batch; 
        
               while (batch_iterator_->Next(&batch)) { 
        
                 // TODO: implement parallelism 
        
                 if (batch.length > 0) { 
        
                   RETURN_NOT_OK(Consume(batch)); 
        
                 } 
        
               } 
        
               Datum out; 
        
               RETURN_NOT_OK(kernel_->finalize(kernel_ctx_, &out)); 
        
               RETURN_NOT_OK(listener->OnResult(std::move(out))); 
        
               return Status::OK(); 
        
             } 
        
             Datum WrapResults(const std::vector<Datum>&, 
        
                               const std::vector<Datum>& outputs) override { 
        
               DCHECK_EQ(1, outputs.size()); 
        
               return outputs[0]; 
        
             } 
        
            private: 
        
             Status Consume(const ExecBatch& batch) { 
        
               // FIXME(ARROW-11840) don't merge *any* aggegates for every batch 
        
               ARROW_ASSIGN_OR_RAISE( 
        
                   auto batch_state, 
        
                   kernel_->init(kernel_ctx_, {kernel_, *input_descrs_, options_})); 
        
               if (batch_state == nullptr) { 
        
                 return Status::Invalid("ScalarAggregation requires non-null kernel state"); 
        
               } 
        
               KernelContext batch_ctx(exec_context()); 
        
               batch_ctx.SetState(batch_state.get()); 
        
               RETURN_NOT_OK(kernel_->consume(&batch_ctx, batch)); 
        
               RETURN_NOT_OK(kernel_->merge(kernel_ctx_, std::move(*batch_state), state())); 
        
               return Status::OK(); 
        
             }

We could/should fix that up too (perhaps in a separate JIRA)

lidavidm · 2022-07-05T15:30:11Z

I suppose it's done that way because of the // TODO: implement parallelism but I wonder if we want to just say, "use Acero for that", or if we do want to actually go implement parallelism

westonpace · 2022-07-05T16:32:58Z

I suppose the 'right' way to test it is to construct an ExecPlan and feed the data through. There are some tests in plan_test.cc but it doesn't have much coverage of the kernels themselves. We may need some parameterization/a helper to test "both ways" of calling aggregates in much the same way hash_aggregate_test.cc does.

@drin is working on this for min/max.

I think there is probably more interest/priority in making the exec plan case work well vs the chunked array case. If you are doing compute on chunked arrays and the entire chunked array fits in memory, then it is probably sufficient to just concatenate the chunked array into a single array at the beginning of your compute work.

github-actions · 2022-07-05T18:20:35Z

https://issues.apache.org/jira/browse/ARROW-16904

github-actions · 2022-07-05T18:20:37Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

jvanstraten · 2022-07-05T18:22:44Z

I've relinked this to ARROW-16904 since it's more indicative of what's being fixed and so ARROW-16700 can be used for the issue related to guarantee expressions, but feel free to overrule this PR with @drin's version once they've worked out test coverage.

drin · 2022-07-05T21:39:20Z

Sorry, I just caught up to the various threads leading here.

I can just add onto this PR, if you don't mind @jvanstraten .

jvanstraten · 2022-07-05T22:29:35Z

Also fine with me. I assume you mean to PR into my branch so it ends up in here?

drin · 2022-07-05T23:07:52Z

yep!

drin · 2022-07-07T22:17:50Z

opened a draft PR to add to this one:

https://github.com/jvanstraten/arrow/pull/5

It looks a bit messy because of rebases, not sure how to easily improve that

lidavidm · 2022-07-08T13:38:06Z

Something is very off with the merge here. Can we do something like cherry-pick the commits onto a fresh branch and then force-push?

Based on multiple locations fixed in 85789a9, it seemed better to simplify a few conditions so that updating `this->state` from `local` is consolidated and more maintainable. This should make it easier to understand that there is only a difference in how the local state is updated when consuming input data and there is no difference in how the aggregate's state is updated from the local state.

This R test captures the code from ARROW-16904 that reproduces this bug. It uses a scalar aggregate, min, on a dataset (1 column) that produces several exec batches.

This test exercises the bug found in ARROW-16904, by creating a ScalarAggregateNode for the "min_max" function. Previously, there was no unit test for scalar aggregate nodes.

This minor improvement splits the min and max values into different chunks of chunked_input1. This doesn't improve coverage given how the scalar aggregate executes on a chunked array, but it seemed a nice extra thing to include

Added to brief documentation for AggregateNodeOptions that provides insight to how the `keys` attribute affects how the `aggregates` attribute is used (or rather, how inputs are delegated to those aggregates).

jvanstraten · 2022-07-08T15:19:06Z

Should be fixed. I guess a merge commit was lost somewhere along the lines and git(hub) got confused.

lidavidm · 2022-07-08T16:40:23Z

cpp/src/arrow/compute/exec/options.h

+/// If the keys attribute is a non-empty vector, then each aggregate in `aggregates` is
+/// expected to be a HashAggregate function. If the keys attribute is an empty vector,
+/// then each aggregate is assumed to be a ScalarAggregate function.


✔️ thank you!

cpp/src/arrow/compute/exec/plan_test.cc

lidavidm · 2022-07-08T16:42:40Z

cpp/src/arrow/compute/exec/plan_test.cc

+
+    auto input = MakeGroupableBatches(/*multiplicity=*/parallel ? 100 : 1);
+    auto minmax_opts = std::make_shared<ScalarAggregateOptions>();
+    auto expected_value = StructScalar::Make(


nit, but doesn't ScalarFromJSON handle StructScalar directly?

I tried looking at the function and searching for usages but I couldn't figure it out. If you know how to do it, I can update it. I wasn't sure if a StructScalar is essentially an object (e.g. { -8, 12} would be a 2 field struct)

arrow/cpp/src/arrow/ipc/json_simple.cc

Lines 659 to 661 in 8042f00

// Append a JSON value that is either an array of N elements in order

// or an object mapping struct names to values (omitted struct members

// are mapped to null).

should be {"min": -8, "max": 12}

lidavidm · 2022-07-08T16:49:35Z

Note the R lints https://github.com/apache/arrow/runs/7253994871?check_suite_focus=true


> lintr::lint_package('/arrow/r')
Warning: file=tests/testthat/test-dataset.R,line=622,col=30,[infix_spaces_linter] Put spaces around all infix operators.
INFO:archery:Running Docker linter
Warning: file=tests/testthat/test-dataset.R,line=624,col=27,[function_left_parentheses_linter] Remove spaces before the left parenthesis in a function call.
Warning: file=tests/testthat/test-dataset.R,line=629,col=8,[pipe_continuation_linter] `%>%` should always have a space before it and a new line after it, unless the full pipeline fits on one line.
Warning: file=tests/testthat/test-dataset.R,line=632,col=26,[single_quotes_linter] Only use double-quotes.
Warning: file=tests/testthat/test-dataset.R,line=632,col=51,[infix_spaces_linter] Put spaces around all infix operators.
Warning: file=tests/testthat/test-dataset.R,line=632,col=52,[single_quotes_linter] Only use double-quotes.
Warning: file=tests/testthat/test-dataset.R,line=635,col=27,[function_left_parentheses_linter] Remove spaces before the left parenthesis in a function call.
Warning: file=tests/testthat/test-dataset.R,line=640,col=8,[pipe_continuation_linter] `%>%` should always have a space before it and a new line after it, unless the full pipeline fits on one line.
> 
>

added comment Co-authored-by: David Li <[email protected]>

drin · 2022-07-08T18:19:34Z

Note the R lints https://github.com/apache/arrow/runs/7253994871?check_suite_focus=true

> lintr::lint_package('/arrow/r')
Warning: file=tests/testthat/test-dataset.R,line=622,col=30,[infix_spaces_linter] Put spaces around all infix operators.
...
Warning: file=tests/testthat/test-dataset.R,line=640,col=8,[pipe_continuation_linter] `%>%` should always have a space before it and a new line after it, unless the full pipeline fits on one line.

thanks. I tried to match style because i had trouble running the linter. I'll fix these and try to get it running.

Used `ScalarFromJSON` to construct the StructScalar instead of `StructScalar::Make`. Also fixed style using clang-format

r/tests/testthat/test-dataset.R

simplifying test body by specifying `chunk_size` to `write_parquet` and using `replicate` instead of `sapply`. Adding other style changes for readability. Co-authored-by: Neal Richardson <[email protected]>

…13518) This updates the Scanner node such that it will use the guarantee expression to fill out columns missing from the dataset but guaranteed to be some constant with appropriate scalars, rather than just inserting a null placeholder column. In case both are available, the dataset constructor prefers using the scalar from the guarantee expression over the actual data, since the latter would probably be an array that unnecessarily repeats the constant value. This is the other part of what was uncovered while analyzing ARROW-16700, the more direct cause being a duplicate of ARROW-16904 (see also #13509 for my fix for that). Lead-authored-by: Jeroen van Straten <[email protected]> Co-authored-by: Aldrin M <[email protected]> Co-authored-by: octalene <[email protected]> Co-authored-by: Aldrin Montana <[email protected]> Co-authored-by: Weston Pace <[email protected]> Signed-off-by: Weston Pace <[email protected]>

github-actions bot added the Component: C++ label Jul 4, 2022

lidavidm approved these changes Jul 5, 2022

View reviewed changes

jvanstraten changed the title ~~ARROW-16700: [C++][R][Datasets] aggregates on partitioning columns~~ ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups Jul 5, 2022

jvanstraten mentioned this pull request Jul 5, 2022

ARROW-16700: [C++][R][Datasets] aggregates on partitioning columns #13518

Merged

github-actions bot added Component: Documentation Component: GLib Component: Java Component: Python Component: R labels Jul 8, 2022

jvanstraten and others added 7 commits July 8, 2022 17:15

Fix min/max discarding state for each partition

f2006f7

[ARROW-16904]: added R test

f48390e

This R test captures the code from ARROW-16904 that reproduces this bug. It uses a scalar aggregate, min, on a dataset (1 column) that produces several exec batches.

[ARROW-16904]: added test to plan_test.cc

431609a

This test exercises the bug found in ARROW-16904, by creating a ScalarAggregateNode for the "min_max" function. Previously, there was no unit test for scalar aggregate nodes.

minor improvement to TestIntegerMinMaxKernel

d5c7479

This minor improvement splits the min and max values into different chunks of chunked_input1. This doesn't improve coverage given how the scalar aggregate executes on a chunked array, but it seemed a nice extra thing to include

[ARROW-16904]: expanded AggregateNodeOptions doc

e60509e

Added to brief documentation for AggregateNodeOptions that provides insight to how the `keys` attribute affects how the `aggregates` attribute is used (or rather, how inputs are delegated to those aggregates).

[ARROW-16904]: minor style change

fb16334

[ARROW-16904]: minor style change

695e29b

jvanstraten force-pushed the ARROW-16700-aggregates-on-partitioning-columns branch from 9dffe78 to 695e29b Compare July 8, 2022 15:16

lidavidm reviewed Jul 8, 2022

View reviewed changes

Update cpp/src/arrow/compute/exec/plan_test.cc

5a3641f

added comment Co-authored-by: David Li <[email protected]>

drin added 2 commits July 8, 2022 12:17

[ARROW-16904]: fixed style for R test

fba6c8c

[ARROW-16904]: simplified Scalar construction

534691b

Used `ScalarFromJSON` to construct the StructScalar instead of `StructScalar::Make`. Also fixed style using clang-format

wjones127 reviewed Jul 8, 2022

View reviewed changes

r/tests/testthat/test-dataset.R Outdated Show resolved Hide resolved

nealrichardson reviewed Jul 8, 2022

View reviewed changes

r/tests/testthat/test-dataset.R Outdated Show resolved Hide resolved

wjones127 reviewed Jul 8, 2022

View reviewed changes

r/tests/testthat/test-dataset.R Outdated Show resolved Hide resolved

drin and others added 2 commits July 8, 2022 12:50

Update r/tests/testthat/test-dataset.R

988b68b

simplifying test body by specifying `chunk_size` to `write_parquet` and using `replicate` instead of `sapply`. Adding other style changes for readability. Co-authored-by: Neal Richardson <[email protected]>

[ARROW-16904]: minor style fix

249af7b

lidavidm approved these changes Jul 11, 2022

View reviewed changes

lidavidm merged commit 66c66d0 into apache:master Jul 11, 2022

drin deleted the ARROW-16700-aggregates-on-partitioning-columns branch July 11, 2022 18:10

asfimport mentioned this pull request Jul 12, 2022

[C++] min/max not deterministic if Parquet files have multiple row groups #20300

Closed

	// Append a JSON value that is either an array of N elements in order
	// or an object mapping struct names to values (omitted struct members
	// are mapped to null).

ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups #13509

ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups #13509

Uh oh!

Conversation

jvanstraten commented Jul 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 4, 2022

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

jvanstraten commented Jul 5, 2022

Uh oh!

lidavidm commented Jul 5, 2022

Uh oh!

lidavidm commented Jul 5, 2022

Uh oh!

lidavidm commented Jul 5, 2022

Uh oh!

westonpace commented Jul 5, 2022

Uh oh!

github-actions bot commented Jul 5, 2022

Uh oh!

github-actions bot commented Jul 5, 2022

Uh oh!

jvanstraten commented Jul 5, 2022

Uh oh!

drin commented Jul 5, 2022

Uh oh!

jvanstraten commented Jul 5, 2022

Uh oh!

drin commented Jul 5, 2022

Uh oh!

drin commented Jul 7, 2022

Uh oh!

lidavidm commented Jul 8, 2022

Uh oh!

jvanstraten commented Jul 8, 2022

Uh oh!

lidavidm Jul 8, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lidavidm Jul 8, 2022

Choose a reason for hiding this comment

Uh oh!

drin Jul 8, 2022

Choose a reason for hiding this comment

Uh oh!

lidavidm Jul 8, 2022

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Jul 8, 2022

Uh oh!

drin commented Jul 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jvanstraten commented Jul 4, 2022 •

edited

Loading

drin commented Jul 8, 2022 •

edited

Loading