Skip to content

Conversation

@jvanstraten
Copy link
Contributor

This updates the Scanner node such that it will use the guarantee expression to fill out columns missing from the dataset but guaranteed to be some constant with appropriate scalars, rather than just inserting a null placeholder column. In case both are available, the dataset constructor prefers using the scalar from the guarantee expression over the actual data, since the latter would probably be an array that unnecessarily repeats the constant value.

This is the other part of what was uncovered while analyzing ARROW-16700, the more direct cause being a duplicate of ARROW-16904 (see also #13509 for my fix for that).

@github-actions
Copy link

github-actions bot commented Jul 5, 2022

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. One suggestion for the test.

@drin drin force-pushed the ARROW-16700-aggregates-on-partitioning-columns-2 branch from 706f409 to 6592ad2 Compare July 14, 2022 17:10
@drin
Copy link
Contributor

drin commented Jul 14, 2022

rebased branch

drin added 2 commits July 15, 2022 15:44
The major part of this commit is to add columns to ExecBatches from the
guarantees vector instead of taking a lambda. The lambda can be
repetitive and a source of error. This still only adds the guarantees to
ExecBatches and not the Dataset
whitespace for test data was demolished by clang-format, this just
restores the whitespace that made it human readable
@drin
Copy link
Contributor

drin commented Jul 15, 2022

I confirmed that the C++ unit tests validate the fix provided in this PR. Still working on an R test that correctly validates the fix.

auto guarantee = partial.fragment.value->partition_expression();
ARROW_ASSIGN_OR_RAISE(
util::optional<compute::ExecBatch> batch,
compute::MakeExecBatch(*scan_options->dataset_schema,
partial.record_batch.value, guarantee));

@westonpace
Copy link
Member

I'm sorry but I believe I have sent you on a wild goose chase. I had suggested an R test but, upon re-reading the JIRA, realize that I had forgotten this is not reproducible from R (because it always inserts a project node prior to an aggregate node) and so no R test is needed.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor suggestions on the unit test and I think this is good to go.

drin and others added 4 commits July 18, 2022 10:33
minor style change

Co-authored-by: Weston Pace <[email protected]>
changing use of ARROW_WARN_NOT_OK

A more appropriate macro was suggested, EXPECT_OK_AND_ASSIGN

Co-authored-by: Weston Pace <[email protected]>
changed variable name and method invocation because EXPECT_OK_AND_ASSIGN
unwraps the result whereas ARROW_RETURN_NOT_OK did not
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both of you for taking care of this.

@westonpace westonpace merged commit e0ccfa1 into apache:master Jul 22, 2022
@drin drin deleted the ARROW-16700-aggregates-on-partitioning-columns-2 branch July 22, 2022 18:20
@ursabot
Copy link

ursabot commented Jul 23, 2022

Benchmark runs are scheduled for baseline = b3ce0fa and contender = e0ccfa1. e0ccfa1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.59% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.63% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.29% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Failed] e0ccfa11 ec2-t3-xlarge-us-east-2
[Failed] e0ccfa11 test-mac-arm
[Failed] e0ccfa11 ursa-i9-9960x
[Finished] e0ccfa11 ursa-thinkcentre-m75q
[Failed] b3ce0fa7 ec2-t3-xlarge-us-east-2
[Failed] b3ce0fa7 test-mac-arm
[Failed] b3ce0fa7 ursa-i9-9960x
[Finished] b3ce0fa7 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

kou pushed a commit that referenced this pull request Feb 20, 2023
…Hub issue numbers (#34260)

Rewrite the Jira issue numbers to the GitHub issue numbers, so that the GitHub issue numbers are automatically linked to the issues by pkgdown's auto-linking feature.

Issue numbers have been rewritten based on the following correspondence.
Also, the pkgdown settings have been changed and updated to link to GitHub.

I generated the Changelog page using the `pkgdown::build_news()` function and verified that the links work correctly.

---
ARROW-6338	#5198
ARROW-6364	#5201
ARROW-6323	#5169
ARROW-6278	#5141
ARROW-6360	#5329
ARROW-6533	#5450
ARROW-6348	#5223
ARROW-6337	#5399
ARROW-10850	#9128
ARROW-10624	#9092
ARROW-10386	#8549
ARROW-6994	#23308
ARROW-12774	#10320
ARROW-12670	#10287
ARROW-16828	#13484
ARROW-14989	#13482
ARROW-16977	#13514
ARROW-13404	#10999
ARROW-16887	#13601
ARROW-15906	#13206
ARROW-15280	#13171
ARROW-16144	#13183
ARROW-16511	#13105
ARROW-16085	#13088
ARROW-16715	#13555
ARROW-16268	#13550
ARROW-16700	#13518
ARROW-16807	#13583
ARROW-16871	#13517
ARROW-16415	#13190
ARROW-14821	#12154
ARROW-16439	#13174
ARROW-16394	#13118
ARROW-16516	#13163
ARROW-16395	#13627
ARROW-14848	#12589
ARROW-16407	#13196
ARROW-16653	#13506
ARROW-14575	#13160
ARROW-15271	#13170
ARROW-16703	#13650
ARROW-16444	#13397
ARROW-15016	#13541
ARROW-16776	#13563
ARROW-15622	#13090
ARROW-18131	#14484
ARROW-18305	#14581
ARROW-18285	#14615
* Closes: #33631

Authored-by: SHIMA Tatsuya <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants