-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups #13509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups #13509
Conversation
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, don't think you're missing anything.
It seems you should be able to hit this by sending multiple batches through the plan, but I would have thought that's already tested.
|
Chunked arrays are tested here, but, as cleaned up from my debug prints for for which it does not matter whether I don't know why it's doing it this way. Seems rather inefficient to me. Evidently though, for larger-scale workloads it does call |
|
My bad, I was looking at the hash aggregate tests! I suppose the 'right' way to test it is to construct an ExecPlan and feed the data through. There are some tests in plan_test.cc but it doesn't have much coverage of the kernels themselves. We may need some parameterization/a helper to test "both ways" of calling aggregates in much the same way hash_aggregate_test.cc does. |
|
The behavior you're seeing stems from this: arrow/cpp/src/arrow/compute/exec.cc Lines 1124 to 1177 in 3d6240c
We could/should fix that up too (perhaps in a separate JIRA) |
|
I suppose it's done that way because of the |
@drin is working on this for min/max. I think there is probably more interest/priority in making the exec plan case work well vs the chunked array case. If you are doing compute on chunked arrays and the entire chunked array fits in memory, then it is probably sufficient to just concatenate the chunked array into a single array at the beginning of your compute work. |
|
|
|
I've relinked this to ARROW-16904 since it's more indicative of what's being fixed and so ARROW-16700 can be used for the issue related to guarantee expressions, but feel free to overrule this PR with @drin's version once they've worked out test coverage. |
|
Sorry, I just caught up to the various threads leading here. I can just add onto this PR, if you don't mind @jvanstraten . |
|
Also fine with me. I assume you mean to PR into my branch so it ends up in here? |
|
yep! |
|
opened a draft PR to add to this one: It looks a bit messy because of rebases, not sure how to easily improve that |
|
Something is very off with the merge here. Can we do something like cherry-pick the commits onto a fresh branch and then force-push? |
Based on multiple locations fixed in 85789a9, it seemed better to simplify a few conditions so that updating `this->state` from `local` is consolidated and more maintainable. This should make it easier to understand that there is only a difference in how the local state is updated when consuming input data and there is no difference in how the aggregate's state is updated from the local state.
This R test captures the code from ARROW-16904 that reproduces this bug. It uses a scalar aggregate, min, on a dataset (1 column) that produces several exec batches.
This test exercises the bug found in ARROW-16904, by creating a ScalarAggregateNode for the "min_max" function. Previously, there was no unit test for scalar aggregate nodes.
This minor improvement splits the min and max values into different chunks of chunked_input1. This doesn't improve coverage given how the scalar aggregate executes on a chunked array, but it seemed a nice extra thing to include
Added to brief documentation for AggregateNodeOptions that provides insight to how the `keys` attribute affects how the `aggregates` attribute is used (or rather, how inputs are delegated to those aggregates).
9dffe78 to
695e29b
Compare
|
Should be fixed. I guess a merge commit was lost somewhere along the lines and git(hub) got confused. |
| /// If the keys attribute is a non-empty vector, then each aggregate in `aggregates` is | ||
| /// expected to be a HashAggregate function. If the keys attribute is an empty vector, | ||
| /// then each aggregate is assumed to be a ScalarAggregate function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✔️ thank you!
|
|
||
| auto input = MakeGroupableBatches(/*multiplicity=*/parallel ? 100 : 1); | ||
| auto minmax_opts = std::make_shared<ScalarAggregateOptions>(); | ||
| auto expected_value = StructScalar::Make( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, but doesn't ScalarFromJSON handle StructScalar directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried looking at the function and searching for usages but I couldn't figure it out. If you know how to do it, I can update it. I wasn't sure if a StructScalar is essentially an object (e.g. { -8, 12} would be a 2 field struct)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arrow/cpp/src/arrow/ipc/json_simple.cc
Lines 659 to 661 in 8042f00
| // Append a JSON value that is either an array of N elements in order | |
| // or an object mapping struct names to values (omitted struct members | |
| // are mapped to null). |
should be {"min": -8, "max": 12}
|
Note the R lints https://github.com/apache/arrow/runs/7253994871?check_suite_focus=true |
added comment Co-authored-by: David Li <[email protected]>
thanks. I tried to match style because i had trouble running the linter. I'll fix these and try to get it running. |
Used `ScalarFromJSON` to construct the StructScalar instead of `StructScalar::Make`. Also fixed style using clang-format
simplifying test body by specifying `chunk_size` to `write_parquet` and using `replicate` instead of `sapply`. Adding other style changes for readability. Co-authored-by: Neal Richardson <[email protected]>
…13518) This updates the Scanner node such that it will use the guarantee expression to fill out columns missing from the dataset but guaranteed to be some constant with appropriate scalars, rather than just inserting a null placeholder column. In case both are available, the dataset constructor prefers using the scalar from the guarantee expression over the actual data, since the latter would probably be an array that unnecessarily repeats the constant value. This is the other part of what was uncovered while analyzing ARROW-16700, the more direct cause being a duplicate of ARROW-16904 (see also #13509 for my fix for that). Lead-authored-by: Jeroen van Straten <[email protected]> Co-authored-by: Aldrin M <[email protected]> Co-authored-by: octalene <[email protected]> Co-authored-by: Aldrin Montana <[email protected]> Co-authored-by: Weston Pace <[email protected]> Signed-off-by: Weston Pace <[email protected]>
The min/max aggregate compute kernels seemed to discard their state between partitions, so they would only aggregate the last partition they see (in each thread).
This is the simplest change I could come up with to fix this, but honestly I'm not sure why the
localvariable even exists. It seems to me it could just be replaced withthis->statedirectly, since there doesn't seem to be any failure path wherethis->stateisn't updated fromlocal. Am I missing something?ETA: I tried to make a test case for this, only to find that there is already a test case for this. In that case however, it seems that the merging of the partition results is done by
Merge'ing the result of separateConsumecalls, rather than chaining multipleConsumecalls. I'm not sure how to trigger the latter behavior from a normal C++ test case.