ARROW-15517: [R] Use WriteNode in write_dataset() #12316

nealrichardson · 2022-02-01T21:21:18Z

This should allow streaming writes in more cases, e.g. with a join.

github-actions · 2022-02-01T21:21:39Z

https://issues.apache.org/jira/browse/ARROW-15517

westonpace

Looks good. Thanks for including this. I hadn't realized we were roundtripping to R in this case.

r/R/dataset-write.R

r/src/compute-exec.cpp

nealrichardson · 2022-04-14T18:24:13Z

r/R/dataset-write.R

This condition probably still needs a test or two

nealrichardson · 2022-04-15T12:20:11Z

The last failing test was that row-level metadata isn't included when you collect() a dataset, with a warning: https://github.com/apache/arrow/blob/master/r/tests/testthat/test-metadata.R#L274-L277

What was actually happening is that no KeyValueMetadata was being preserved on write in this PR, we just don't have any tests currently around metadata in write_dataset, only this one about row-level metadata (in this PR, neither row-level nor any other kind of metadata was present).

I pushed a fix that will grab the KVM from the source data object in the query and use that, and that fixes the failing tests, but I'm not sure that's totally correct. If you have a join or aggregation, it won't make sense; we have some logic in do_exec_plan() to handle some of this, should we factor that out and apply it in write_dataset()? Also more than happy to defer handling this since (afaict) this PR is consistent with the status quo, and we don't have any assertions about the behavior (that I can find) of general metadata on dataset write, nor any handling for amalgamating metadata when joining, etc. Your call @jonkeane.

jonkeane · 2022-04-18T13:27:51Z

I took a read through and this looks good this far.

I pushed a fix that will grab the KVM from the source data object in the query and use that, and that fixes the failing tests, but I'm not sure that's totally correct. If you have a join or aggregation, it won't make sense; we have some logic in do_exec_plan() to handle some of this, should we factor that out and apply it in write_dataset()? Also more than happy to defer handling this since (afaict) this PR is consistent with the status quo, and we don't have any assertions about the behavior (that I can find) of general metadata on dataset write, nor any handling for amalgamating metadata when joining, etc. Your call @jonkeane.

If it's not too much, I would say we should factor out what's in do_exec_plan and at least do that much — we can (and should) defer on what amalgamating when you join (and AFAICT the code in do_exec_plan handles the aggregation case). We have had a few issues and twitter threads about metadata + datasets (at least in the cases where it failed with row-level metadata!) but folks expected that would just work. And of course, we should also have tests for that to assert that's what we intend — though we can do that as a follow on as well.

The only other question I have is: https://github.com/apache/arrow/pull/12316/files#r850699081 did you add tests for these? I didn't see them, but maybe I'm missing something!

This reverts commit afb6ca3453057f55fa8ba23433fce06949badf99.

nealrichardson · 2022-04-18T20:45:18Z

I factored out the metadata helper from do_exec_plan. No tests yet that it does what is expected on dataset write, but the existing metadata tests pass. I also have not yet added that other test for the topk or sorted dataset write, did not muster enough brain cells today. Topk seems like a simple test but I'm not sure about sorting, IDK what guarantees there are or should be around sorting in the files that write_dataset produces.

jonkeane · 2022-04-18T21:10:34Z

nods thanks! I can take over adding the tests (tomorrow) if you would like

westonpace

This looks right to me. Added my thoughts on the metadata issue

r/R/dataset-write.R

westonpace · 2022-04-18T21:21:31Z

r/R/dataset-write.R

+  validate_positive_int_value(max_partitions)
+  validate_positive_int_value(max_open_files)
+  validate_positive_int_value(min_rows_per_group)
+  validate_positive_int_value(max_rows_per_group)


The error message says non-missing and yet we have defaults for all of these properties (and line 196 seems to tolerate a missing max_rows_per_group. Are they truly required to be non-missing?

I believe it means "not NA" rather than "omitted", judging from the actual validation it is doing

westonpace · 2022-04-18T21:32:24Z

r/R/metadata.R

+  # TODO: do we care about other (non-R) metadata preservation?
+  # How would we know if it were meaningful?


I think it depends on the source of old_schema.

In the general case, the input is a collection of files and the output is a different set of files (sometimes we explode files and sometimes we merge files). The idea of writing metadata to the output files in somewhat meaningless. So, in general, I would say no, you don't care about preservation.

In python, users can create a dataset from a single file, and we do a little bit of work to preserve the metadata on write because we want to feel like it "round trips".

When creating or appending to a dataset users might want to specify general information about how the files were created, like "Origin": "Nightly update" but that is unrelated to the original metadata.

In the future the dataset write may append its own metadata (e.g. dataset statistics, or information about the dataset schema such as which columns are already sorted, etc.)

nealrichardson · 2022-04-19T14:29:53Z

@jonkeane I added a test for the topk dataset writing. Sorting seems like a separate beast and has a C++ issue already. So I think this is done.

ursabot · 2022-04-21T00:51:41Z

Benchmark runs are scheduled for baseline = 4544f95 and contender = 4b3f467. 4b3f467 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.08% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.21% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/545| 4b3f4677 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/532| 4b3f4677 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/531| 4b3f4677 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/542| 4b3f4677 ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/544| 4544f953 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/531| 4544f953 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/529| 4544f953 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/541| 4544f953 ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

nealrichardson requested review from jonkeane and westonpace February 1, 2022 21:21

github-actions bot added the Component: R label Feb 1, 2022

apache deleted a comment from github-actions bot Feb 1, 2022

westonpace approved these changes Feb 14, 2022

View reviewed changes

r/R/dataset-write.R Outdated Show resolved Hide resolved

r/src/compute-exec.cpp Outdated Show resolved Hide resolved

nealrichardson force-pushed the write-node branch from db8f4b9 to 24bf41c Compare April 14, 2022 18:23

nealrichardson commented Apr 14, 2022

View reviewed changes

r/R/dataset-write.R Outdated

Copy link

Member Author

nealrichardson Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition probably still needs a test or two

nealrichardson force-pushed the write-node branch from c992b33 to 51c5b0b Compare April 14, 2022 19:07

nealrichardson marked this pull request as ready for review April 14, 2022 19:07

nealrichardson requested a review from westonpace April 14, 2022 19:07

nealrichardson added 8 commits April 18, 2022 11:23

ARROW-15517: [R] Use WriteNode in write_dataset()

dee8959

Initialize dataset node factory

b8938c3

Fix duckdb test (not sure why failure shows up here)

4bdf78b

Attempt to pass metadata in (does not fix test)

ede4232

Grab the source_data for the metadata

84de6fe

Fix the write_dataset metadata handling and add a test

5f58449

Revert "Fix duckdb test (not sure why failure shows up here)"

8481b2d

This reverts commit afb6ca3453057f55fa8ba23433fce06949badf99.

Factor out r metadata trimming from do_exec_plan; needs tests

4732418

nealrichardson force-pushed the write-node branch from f190028 to 4732418 Compare April 18, 2022 20:42

westonpace approved these changes Apr 18, 2022

View reviewed changes

nealrichardson added 2 commits April 19, 2022 10:03

Add head test

42f8b4a

Clean up table argument to ParquetWriterProperties

87efe9f

github-actions bot added the Component: Parquet label Apr 19, 2022

jonkeane closed this in 4b3f467 Apr 19, 2022

nealrichardson deleted the write-node branch April 20, 2022 00:07

asfimport mentioned this pull request Apr 21, 2022

[R] Use WriteNode in write_dataset() #30990

Closed

		# TODO: do we care about other (non-R) metadata preservation?
		# How would we know if it were meaningful?

ARROW-15517: [R] Use WriteNode in write_dataset() #12316

ARROW-15517: [R] Use WriteNode in write_dataset() #12316

Uh oh!

Conversation

nealrichardson commented Feb 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 1, 2022

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nealrichardson Apr 14, 2022

Choose a reason for hiding this comment

Uh oh!

nealrichardson commented Apr 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonkeane commented Apr 18, 2022

Uh oh!

nealrichardson commented Apr 18, 2022

Uh oh!

jonkeane commented Apr 18, 2022

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

westonpace Apr 18, 2022

Choose a reason for hiding this comment

Uh oh!

nealrichardson Apr 19, 2022

Choose a reason for hiding this comment

Uh oh!

westonpace Apr 18, 2022

Choose a reason for hiding this comment

Uh oh!

nealrichardson commented Apr 19, 2022

Uh oh!

ursabot commented Apr 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nealrichardson commented Feb 1, 2022 •

edited

Loading

nealrichardson commented Apr 15, 2022 •

edited

Loading