Skip to content

Conversation

@nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented Feb 1, 2022

This should allow streaming writes in more cases, e.g. with a join.

@github-actions
Copy link

github-actions bot commented Feb 1, 2022

@apache apache deleted a comment from github-actions bot Feb 1, 2022
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for including this. I hadn't realized we were roundtripping to R in this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition probably still needs a test or two

@nealrichardson nealrichardson marked this pull request as ready for review April 14, 2022 19:07
@nealrichardson
Copy link
Member Author

nealrichardson commented Apr 15, 2022

The last failing test was that row-level metadata isn't included when you collect() a dataset, with a warning: https://github.com/apache/arrow/blob/master/r/tests/testthat/test-metadata.R#L274-L277

What was actually happening is that no KeyValueMetadata was being preserved on write in this PR, we just don't have any tests currently around metadata in write_dataset, only this one about row-level metadata (in this PR, neither row-level nor any other kind of metadata was present).

I pushed a fix that will grab the KVM from the source data object in the query and use that, and that fixes the failing tests, but I'm not sure that's totally correct. If you have a join or aggregation, it won't make sense; we have some logic in do_exec_plan() to handle some of this, should we factor that out and apply it in write_dataset()? Also more than happy to defer handling this since (afaict) this PR is consistent with the status quo, and we don't have any assertions about the behavior (that I can find) of general metadata on dataset write, nor any handling for amalgamating metadata when joining, etc. Your call @jonkeane.

@jonkeane
Copy link
Member

I took a read through and this looks good this far.

I pushed a fix that will grab the KVM from the source data object in the query and use that, and that fixes the failing tests, but I'm not sure that's totally correct. If you have a join or aggregation, it won't make sense; we have some logic in do_exec_plan() to handle some of this, should we factor that out and apply it in write_dataset()? Also more than happy to defer handling this since (afaict) this PR is consistent with the status quo, and we don't have any assertions about the behavior (that I can find) of general metadata on dataset write, nor any handling for amalgamating metadata when joining, etc. Your call @jonkeane.

If it's not too much, I would say we should factor out what's in do_exec_plan and at least do that much — we can (and should) defer on what amalgamating when you join (and AFAICT the code in do_exec_plan handles the aggregation case). We have had a few issues and twitter threads about metadata + datasets (at least in the cases where it failed with row-level metadata!) but folks expected that would just work. And of course, we should also have tests for that to assert that's what we intend — though we can do that as a follow on as well.

The only other question I have is: https://github.com/apache/arrow/pull/12316/files#r850699081 did you add tests for these? I didn't see them, but maybe I'm missing something!

@nealrichardson
Copy link
Member Author

I factored out the metadata helper from do_exec_plan. No tests yet that it does what is expected on dataset write, but the existing metadata tests pass. I also have not yet added that other test for the topk or sorted dataset write, did not muster enough brain cells today. Topk seems like a simple test but I'm not sure about sorting, IDK what guarantees there are or should be around sorting in the files that write_dataset produces.

@jonkeane
Copy link
Member

nods thanks! I can take over adding the tests (tomorrow) if you would like

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks right to me. Added my thoughts on the metadata issue

Comment on lines +200 to +203
validate_positive_int_value(max_partitions)
validate_positive_int_value(max_open_files)
validate_positive_int_value(min_rows_per_group)
validate_positive_int_value(max_rows_per_group)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message says non-missing and yet we have defaults for all of these properties (and line 196 seems to tolerate a missing max_rows_per_group. Are they truly required to be non-missing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it means "not NA" rather than "omitted", judging from the actual validation it is doing

Comment on lines +214 to +215
# TODO: do we care about other (non-R) metadata preservation?
# How would we know if it were meaningful?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends on the source of old_schema.

In the general case, the input is a collection of files and the output is a different set of files (sometimes we explode files and sometimes we merge files). The idea of writing metadata to the output files in somewhat meaningless. So, in general, I would say no, you don't care about preservation.

In python, users can create a dataset from a single file, and we do a little bit of work to preserve the metadata on write because we want to feel like it "round trips".

When creating or appending to a dataset users might want to specify general information about how the files were created, like "Origin": "Nightly update" but that is unrelated to the original metadata.

In the future the dataset write may append its own metadata (e.g. dataset statistics, or information about the dataset schema such as which columns are already sorted, etc.)

@nealrichardson
Copy link
Member Author

@jonkeane I added a test for the topk dataset writing. Sorting seems like a separate beast and has a C++ issue already. So I think this is done.

@jonkeane jonkeane closed this in 4b3f467 Apr 19, 2022
@nealrichardson nealrichardson deleted the write-node branch April 20, 2022 00:07
@ursabot
Copy link

ursabot commented Apr 21, 2022

Benchmark runs are scheduled for baseline = 4544f95 and contender = 4b3f467. 4b3f467 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.08% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.21% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/545| 4b3f4677 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/532| 4b3f4677 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/531| 4b3f4677 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/542| 4b3f4677 ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/544| 4544f953 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/531| 4544f953 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/529| 4544f953 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/541| 4544f953 ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants