ARROW-16700: [C++][R][Datasets] aggregates on partitioning columns #13518

jvanstraten · 2022-07-05T18:33:56Z

This updates the Scanner node such that it will use the guarantee expression to fill out columns missing from the dataset but guaranteed to be some constant with appropriate scalars, rather than just inserting a null placeholder column. In case both are available, the dataset constructor prefers using the scalar from the guarantee expression over the actual data, since the latter would probably be an array that unnecessarily repeats the constant value.

This is the other part of what was uncovered while analyzing ARROW-16700, the more direct cause being a duplicate of ARROW-16904 (see also #13509 for my fix for that).

github-actions · 2022-07-05T18:34:16Z

https://issues.apache.org/jira/browse/ARROW-16700

westonpace

This looks good. One suggestion for the test.

cpp/src/arrow/dataset/scanner_test.cc

drin · 2022-07-14T17:11:07Z

rebased branch

The major part of this commit is to add columns to ExecBatches from the guarantees vector instead of taking a lambda. The lambda can be repetitive and a source of error. This still only adds the guarantees to ExecBatches and not the Dataset

whitespace for test data was demolished by clang-format, this just restores the whitespace that made it human readable

drin · 2022-07-15T23:07:00Z

I confirmed that the C++ unit tests validate the fix provided in this PR. Still working on an R test that correctly validates the fix.

arrow/cpp/src/arrow/dataset/scanner.cc

Lines 911 to 916 in 6592ad2

    
           auto guarantee = partial.fragment.value->partition_expression(); 
        
           ARROW_ASSIGN_OR_RAISE( 
        
               util::optional<compute::ExecBatch> batch, 
        
               compute::MakeExecBatch(*scan_options->dataset_schema, 
        
                                      partial.record_batch.value, guarantee));

westonpace · 2022-07-18T14:42:01Z

I'm sorry but I believe I have sent you on a wild goose chase. I had suggested an R test but, upon re-reading the JIRA, realize that I had forgotten this is not reproducible from R (because it always inserts a project node prior to an aggregate node) and so no R test is needed.

westonpace

A few minor suggestions on the unit test and I think this is good to go.

cpp/src/arrow/dataset/scanner_test.cc

minor style change Co-authored-by: Weston Pace <[email protected]>

changing use of ARROW_WARN_NOT_OK A more appropriate macro was suggested, EXPECT_OK_AND_ASSIGN Co-authored-by: Weston Pace <[email protected]>

changed variable name and method invocation because EXPECT_OK_AND_ASSIGN unwraps the result whereas ARROW_RETURN_NOT_OK did not

westonpace

Thanks both of you for taking care of this.

cpp/src/arrow/dataset/scanner_test.cc

ursabot · 2022-07-23T18:41:21Z

Benchmark runs are scheduled for baseline = b3ce0fa and contender = e0ccfa1. e0ccfa1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.59% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.63% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.29% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Failed] e0ccfa11 ec2-t3-xlarge-us-east-2
[Failed] e0ccfa11 test-mac-arm
[Failed] e0ccfa11 ursa-i9-9960x
[Finished] e0ccfa11 ursa-thinkcentre-m75q
[Failed] b3ce0fa7 ec2-t3-xlarge-us-east-2
[Failed] b3ce0fa7 test-mac-arm
[Failed] b3ce0fa7 ursa-i9-9960x
[Finished] b3ce0fa7 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

…Hub issue numbers (#34260) Rewrite the Jira issue numbers to the GitHub issue numbers, so that the GitHub issue numbers are automatically linked to the issues by pkgdown's auto-linking feature. Issue numbers have been rewritten based on the following correspondence. Also, the pkgdown settings have been changed and updated to link to GitHub. I generated the Changelog page using the `pkgdown::build_news()` function and verified that the links work correctly. --- ARROW-6338 #5198 ARROW-6364 #5201 ARROW-6323 #5169 ARROW-6278 #5141 ARROW-6360 #5329 ARROW-6533 #5450 ARROW-6348 #5223 ARROW-6337 #5399 ARROW-10850 #9128 ARROW-10624 #9092 ARROW-10386 #8549 ARROW-6994 #23308 ARROW-12774 #10320 ARROW-12670 #10287 ARROW-16828 #13484 ARROW-14989 #13482 ARROW-16977 #13514 ARROW-13404 #10999 ARROW-16887 #13601 ARROW-15906 #13206 ARROW-15280 #13171 ARROW-16144 #13183 ARROW-16511 #13105 ARROW-16085 #13088 ARROW-16715 #13555 ARROW-16268 #13550 ARROW-16700 #13518 ARROW-16807 #13583 ARROW-16871 #13517 ARROW-16415 #13190 ARROW-14821 #12154 ARROW-16439 #13174 ARROW-16394 #13118 ARROW-16516 #13163 ARROW-16395 #13627 ARROW-14848 #12589 ARROW-16407 #13196 ARROW-16653 #13506 ARROW-14575 #13160 ARROW-15271 #13170 ARROW-16703 #13650 ARROW-16444 #13397 ARROW-15016 #13541 ARROW-16776 #13563 ARROW-15622 #13090 ARROW-18131 #14484 ARROW-18305 #14581 ARROW-18285 #14615 * Closes: #33631 Authored-by: SHIMA Tatsuya <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

github-actions bot added the Component: C++ label Jul 5, 2022

westonpace requested changes Jul 13, 2022

View reviewed changes

cpp/src/arrow/dataset/scanner_test.cc Outdated Show resolved Hide resolved

Materialize constant columns by means of guarantee expr in Scanner

6592ad2

drin force-pushed the ARROW-16700-aggregates-on-partitioning-columns-2 branch from 706f409 to 6592ad2 Compare July 14, 2022 17:10

drin added 2 commits July 15, 2022 15:44

fixing whitespace

f8ff276

whitespace for test data was demolished by clang-format, this just restores the whitespace that made it human readable

attempt to style fix

56fd917

westonpace requested changes Jul 18, 2022

View reviewed changes

cpp/src/arrow/dataset/scanner_test.cc Outdated Show resolved Hide resolved

cpp/src/arrow/dataset/scanner_test.cc Outdated Show resolved Hide resolved

drin and others added 4 commits July 18, 2022 10:33

Update cpp/src/arrow/dataset/scanner_test.cc

8aad784

minor style change Co-authored-by: Weston Pace <[email protected]>

Update cpp/src/arrow/dataset/scanner_test.cc

a4e685a

changing use of ARROW_WARN_NOT_OK A more appropriate macro was suggested, EXPECT_OK_AND_ASSIGN Co-authored-by: Weston Pace <[email protected]>

[ARROW-16700]: new macro has different return type

23163b8

changed variable name and method invocation because EXPECT_OK_AND_ASSIGN unwraps the result whereas ARROW_RETURN_NOT_OK did not

[ARROW-16700]: fixed style

b38719b

westonpace approved these changes Jul 22, 2022

View reviewed changes

cpp/src/arrow/dataset/scanner_test.cc Outdated Show resolved Hide resolved

Update cpp/src/arrow/dataset/scanner_test.cc

5ff90cc

westonpace merged commit e0ccfa1 into apache:master Jul 22, 2022

drin deleted the ARROW-16700-aggregates-on-partitioning-columns-2 branch July 22, 2022 18:20

eitsupi mentioned this pull request Feb 19, 2023

GH-33631: [R] Rewrite Jira ticket numbers in pkgdown documents to GitHub issue numbers #34260

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-16700: [C++][R][Datasets] aggregates on partitioning columns #13518

ARROW-16700: [C++][R][Datasets] aggregates on partitioning columns #13518

Uh oh!

jvanstraten commented Jul 5, 2022

Uh oh!

github-actions bot commented Jul 5, 2022

Uh oh!

westonpace left a comment

Uh oh!

Uh oh!

drin commented Jul 14, 2022

Uh oh!

drin commented Jul 15, 2022

Uh oh!

westonpace commented Jul 18, 2022

Uh oh!

westonpace left a comment

Uh oh!

Uh oh!

Uh oh!

westonpace left a comment

Uh oh!

Uh oh!

ursabot commented Jul 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ARROW-16700: [C++][R][Datasets] aggregates on partitioning columns #13518

ARROW-16700: [C++][R][Datasets] aggregates on partitioning columns #13518

Uh oh!

Conversation

jvanstraten commented Jul 5, 2022

Uh oh!

github-actions bot commented Jul 5, 2022

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

drin commented Jul 14, 2022

Uh oh!

drin commented Jul 15, 2022

Uh oh!

westonpace commented Jul 18, 2022

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ursabot commented Jul 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants