-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-15517: [R] Use WriteNode in write_dataset() #12316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks for including this. I hadn't realized we were roundtripping to R in this case.
db8f4b9 to
24bf41c
Compare
r/R/dataset-write.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition probably still needs a test or two
c992b33 to
51c5b0b
Compare
|
The last failing test was that row-level metadata isn't included when you collect() a dataset, with a warning: https://github.com/apache/arrow/blob/master/r/tests/testthat/test-metadata.R#L274-L277 What was actually happening is that no KeyValueMetadata was being preserved on write in this PR, we just don't have any tests currently around metadata in write_dataset, only this one about row-level metadata (in this PR, neither row-level nor any other kind of metadata was present). I pushed a fix that will grab the KVM from the source data object in the query and use that, and that fixes the failing tests, but I'm not sure that's totally correct. If you have a join or aggregation, it won't make sense; we have some logic in do_exec_plan() to handle some of this, should we factor that out and apply it in write_dataset()? Also more than happy to defer handling this since (afaict) this PR is consistent with the status quo, and we don't have any assertions about the behavior (that I can find) of general metadata on dataset write, nor any handling for amalgamating metadata when joining, etc. Your call @jonkeane. |
|
I took a read through and this looks good this far.
If it's not too much, I would say we should factor out what's in The only other question I have is: https://github.com/apache/arrow/pull/12316/files#r850699081 did you add tests for these? I didn't see them, but maybe I'm missing something! |
This reverts commit afb6ca3453057f55fa8ba23433fce06949badf99.
f190028 to
4732418
Compare
|
I factored out the metadata helper from do_exec_plan. No tests yet that it does what is expected on dataset write, but the existing metadata tests pass. I also have not yet added that other test for the topk or sorted dataset write, did not muster enough brain cells today. Topk seems like a simple test but I'm not sure about sorting, IDK what guarantees there are or should be around sorting in the files that write_dataset produces. |
|
nods thanks! I can take over adding the tests (tomorrow) if you would like |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks right to me. Added my thoughts on the metadata issue
| validate_positive_int_value(max_partitions) | ||
| validate_positive_int_value(max_open_files) | ||
| validate_positive_int_value(min_rows_per_group) | ||
| validate_positive_int_value(max_rows_per_group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message says non-missing and yet we have defaults for all of these properties (and line 196 seems to tolerate a missing max_rows_per_group. Are they truly required to be non-missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it means "not NA" rather than "omitted", judging from the actual validation it is doing
| # TODO: do we care about other (non-R) metadata preservation? | ||
| # How would we know if it were meaningful? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends on the source of old_schema.
In the general case, the input is a collection of files and the output is a different set of files (sometimes we explode files and sometimes we merge files). The idea of writing metadata to the output files in somewhat meaningless. So, in general, I would say no, you don't care about preservation.
In python, users can create a dataset from a single file, and we do a little bit of work to preserve the metadata on write because we want to feel like it "round trips".
When creating or appending to a dataset users might want to specify general information about how the files were created, like "Origin": "Nightly update" but that is unrelated to the original metadata.
In the future the dataset write may append its own metadata (e.g. dataset statistics, or information about the dataset schema such as which columns are already sorted, etc.)
|
@jonkeane I added a test for the topk dataset writing. Sorting seems like a separate beast and has a C++ issue already. So I think this is done. |
|
Benchmark runs are scheduled for baseline = 4544f95 and contender = 4b3f467. 4b3f467 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This should allow streaming writes in more cases, e.g. with a join.