-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6532 [R] write_parquet() uses writer properties (general and arrow specific) #5451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6532 [R] write_parquet() uses writer properties (general and arrow specific) #5451
Conversation
|
Yeah, I'm not sure about this on a few levels. I think the R I want to type to write a compressed Parquet file looks like I'm also not sure whether this works as intended. The Parquet C++ code seems to have its own compression and writing logic; that may be historical artifact, or it may be meaningful. Maybe we can get away without implementing bindings for those classes--the proof would be a passing test of writing a compressed parquet file and reading it back in. Then again, maybe in principle we should write the Parquet bindings to match the C++ library. |
|
It is not a good idea to write a Parquet file into a CompressedOutputStream. Such file will not be readable with Parquet already compresses data internally. |
|
Here's the way we handle it in Python, you'll need to do the same thing in R https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L363 |
d85b6fc to
aa2833e
Compare
|
Some progress inspired from the python implementation. write_parquet <- function(
table,
sink, chunk_size = NULL,
version = NULL, compression = NULL, use_dictionary = NULL, write_statistics = NULL, data_page_size = NULL,
properties = ParquetWriterProperties$create(
version = version,
compression = compression,
use_dictionary = use_dictionary,
write_statistics = write_statistics,
data_page_size = data_page_size
),
use_deprecated_int96_timestamps = FALSE, coerce_timestamps = NULL, allow_truncated_timestamps = FALSE,
arrow_properties = ParquetArrowWriterProperties$create(
use_deprecated_int96_timestamps = use_deprecated_int96_timestamps,
coerce_timestamps = coerce_timestamps,
allow_truncated_timestamps = allow_truncated_timestamps
)
)that are managed by the classes Only simple versions so far, e.g. library(arrow, warn.conflicts = FALSE)
df <- tibble::tibble(x = 1:5)
write_parquet(df, "/tmp/test.parquet", compression = "snappy")
read_parquet("/tmp/test.parquet")
#> # A tibble: 5 x 1
#> x
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5but we can't e.g. specify specific variables to handle by such and such compression. This is a good place I think for a tidy select, e.g. something like that: df <- tibble::tibble(x1 = 1:5, x2 = 1:5, y = 1:5)
write_parquet(df, "/tmp/test.parquet",
compression = list(snappy = starts_with("x"))
)The list in python goes the other way, so if we do something similar it would look like write_parquet(df, "/tmp/test.parquet",
compression = list(x1 = "snappy", x2 = "snappy")
)perhaps we can have write_parquet(df, "/tmp/test.parquet",
compression = compression_spec(snappy = starts_with("x"))
) |
|
One option we discussed with @nealrichardson was to be able to do e.g. write_parquet(df, "/tmp/test.parquet",
compression = Codec$create("snappy", 5L)
)But unfortunately, the C++ class Instead, I followed python's lead and we can do this instead: write_parquet(df, "/tmp/test.parquet",
compression = "snappy",
compression_level = 5L
) |
|
These arguments that are handled by |
|
Taking a look now; FTR Travis says |
Codecov Report
@@ Coverage Diff @@
## master #5451 +/- ##
===========================================
- Coverage 88.7% 76.76% -11.94%
===========================================
Files 964 59 -905
Lines 128215 4330 -123885
Branches 1501 0 -1501
===========================================
- Hits 113731 3324 -110407
+ Misses 14119 1006 -13113
+ Partials 365 0 -365
Continue to review full report at Codecov.
|
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more notes. I'd also like to see better coverage on https://codecov.io/gh/apache/arrow/pull/5451/diff
r/R/parquet.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd write this function as
make_valid_version <- function(version, valid_versions = valid_parquet_version) {
pq_version <- valid_version[[version]]
if (is.null(pq_version)) {
stop('"version" should be one of ', oxford_paste(names(valid_versions), "or"), call.=FALSE)
}
pq_version
}
As it stands, make_valid_version(1) won't work, and it seems like it should.
Per the codecov report, this code isn't being exercised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wfm:
arrow:::make_valid_version("1.0")
#> [1] 0
arrow:::make_valid_version("2.0")
#> [1] 1
arrow:::make_valid_version(1)
#> [1] 0
arrow:::make_valid_version(2)
#> [1] 1Created on 2019-09-27 by the reprex package (v0.3.0.9000)
r/R/parquet.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure there's value including properties and arrow_properties in the signature here. I kept them in read_delim_arrow() because there were some properties they expose that aren't mapped to arguments the readr::read_delim signature but that doesn't seem to be the case here. (On reflection, that's probably not the right call there either; if you want lower-level access to those settings, you should probably be doing CsvTableReader$create(...) anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My rationale was that perhaps you'd already have built those objects properties and arrow_properties before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I get the point that maybe this could be diverted to using a ParquetFileWriter instance
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes on the docs
r/R/parquet.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And not all columns need to be specified, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
r/R/parquet.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this follow the same conventions as compression? Maybe there should be a paragraph/section in the docs that explains how these parameters work since it's the same/similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored the documentation in @details
r/R/parquet.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, and what are statistics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know
… file path for more flexibility
…rrow::WriterProperties to R side
…ties_Builder class skeleton
…perties to write_parquet()
Co-Authored-By: Neal Richardson <[email protected]>
…ch class. Many tests were false positives
354d263 to
50555f8
Compare
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One final style question but otherwise LGTM, happy to merge today regardless of where we land on the whitespace question.
| as_data_frame = TRUE, | ||
| props = ParquetReaderProperties$create(), | ||
| ...) { | ||
| col_select = NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is "bad", according to the tidyverse style guide, which I believed we were trying to follow: https://style.tidyverse.org/functions.html#long-lines-1
I can get used to whatever style conventions we decide, just want to make sure we're in agreement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll setup my Rstudio to obey the style, perhaps we should use styler:: once in a while to do that automatically.
|
Can you update the PR description to reflect what is actually in the PR (since writing a Parquet file into a CompressedOutputStream isn't recommendable -- you would have to decompress the entire file first to be able to read any part of it) |
The ability to preserve categorical values was introduced in #5077 as the convention of storing a special `ARROW:schema` key in the metadata. To invoke this, we need to call `ArrowWriterProperties::store_schema()`. The R binding is already ready for this, but calls `store_schema()` only conditionally and uses `parquet___default_arrow_writer_properties()` by default. Though I don't see the motivation to implement as such in #5451, considering [the Python binding always calls `store_schema()`](https://github.com/apache/arrow/blob/dbe708c7527a4aa6b63df7722cd57db4e0bd2dc7/python/pyarrow/_parquet.pyx#L1269), I guess the R code can do the same. Closes #6135 from yutannihilation/ARROW-7045_preserve_factor_in_parquet and squashes the following commits: 9227e7e <Hiroaki Yutani> Fix test 4d8bb46 <Hiroaki Yutani> Remove default_arrow_writer_properties() dfd08cb <Hiroaki Yutani> Add failing tests Authored-by: Hiroaki Yutani <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
This adds parameters to
write_parquet()to control compression, whether to use dictionary, etc ... on top of the C++ classesparquet::WriterPropertiesandparquet::ArrowWriterPropertiese.g.