-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-16511: [R] Preserve schema metadata in write_dataset() #13105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16511: [R] Preserve schema metadata in write_dataset() #13105
Conversation
jonkeane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for the quick PR + a bit of cleanup along the way.
| # For backwards compatibility with Scanner-based writer (arrow <= 7.0.0): | ||
| # retain metadata from source dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we had one already, a jira would be nice here, but I'm sure we'll remember this is where it's going even without it, so let'snot
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I checked the sfarrow example that failed and it works on this branch (and fails on the master branch, as expected):
Details
# remotes::install_github("apache/arrow/r#13105")
library(arrow, warn.conflicts = FALSE)
library(sfarrow)
# read spatial object
nc <- sf::st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
# create random grouping
nc$group <- sample(1:3, nrow(nc), replace = TRUE)
# use dplyr to group the dataset. %>% also allowed
nc_g <- dplyr::group_by(nc, group)
# write out to parquet datasets
tf <- tempfile() # create temporary location
# partitioning determined by dplyr 'group_vars'
write_sf_dataset(nc_g, path = tf)
#> Warning: This is an initial implementation of Parquet/Feather file support and
#> geo metadata. This is tracking version 0.1.0 of the metadata
#> (https://github.com/geopandas/geo-arrow-spec). This metadata
#> specification may change and does not yet make stability promises. We
#> do not yet recommend using this in a production setting unless you are
#> able to rewrite your Parquet/Feather files.
list.files(tf, recursive = TRUE)
#> [1] "group=1/part-0.parquet" "group=2/part-0.parquet" "group=3/part-0.parquet"
# open parquet files from dataset
ds <- arrow::open_dataset(tf)
# create a query. %>% also allowed
q <- dplyr::filter(ds, group == 1)
# read the dataset (piping syntax also works)
read_sf_dataset(dataset = q)
#> Simple feature collection with 31 features and 15 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -83.73952 ymin: 33.88199 xmax: -75.7637 ymax: 36.55716
#> Geodetic CRS: NAD27
#> First 10 features:
#> AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74
#> 1 0.070 2.968 1831 1831 Currituck 37053 37053 27 508 1
#> 2 0.153 2.206 1832 1832 Northampton 37131 37131 66 1421 9
#> 3 0.109 1.325 1841 1841 Person 37145 37145 73 1556 4
#> 4 0.190 2.204 1846 1846 Halifax 37083 37083 42 3608 18
#> 5 0.081 1.288 1880 1880 Watauga 37189 37189 95 1323 1
#> 6 0.086 1.267 1893 1893 Yadkin 37197 37197 99 1269 1
#> 7 0.111 1.392 1904 1904 Alamance 37001 37001 1 4672 13
#> 8 0.059 1.319 1927 1927 Mitchell 37121 37121 61 671 0
#> 9 0.122 1.516 1932 1932 Caldwell 37027 37027 14 3609 6
#> 10 0.080 1.307 1936 1936 Yancey 37199 37199 100 770 0
#> NWBIR74 BIR79 SID79 NWBIR79 group geometry
#> 1 123 830 2 145 1 MULTIPOLYGON (((-76.00897 3...
#> 2 1066 1606 3 1197 1 MULTIPOLYGON (((-77.21767 3...
#> 3 613 1790 4 650 1 MULTIPOLYGON (((-78.8068 36...
#> 4 2365 4463 17 2980 1 MULTIPOLYGON (((-77.33221 3...
#> 5 17 1775 1 33 1 MULTIPOLYGON (((-81.80622 3...
#> 6 65 1568 1 76 1 MULTIPOLYGON (((-80.49554 3...
#> 7 1243 5767 11 1397 1 MULTIPOLYGON (((-79.24619 3...
#> 8 1 919 2 4 1 MULTIPOLYGON (((-82.11885 3...
#> 9 309 4249 9 360 1 MULTIPOLYGON (((-81.32813 3...
#> 10 12 869 1 10 1 MULTIPOLYGON (((-82.27921 3...Created on 2022-05-09 by the reprex package (v2.0.1)
Closes apache#13105 from nealrichardson/write-dataset-metadata Authored-by: Neal Richardson <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
|
Benchmark runs are scheduled for baseline = 214135d and contender = d00caa9. d00caa9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…Hub issue numbers (#34260) Rewrite the Jira issue numbers to the GitHub issue numbers, so that the GitHub issue numbers are automatically linked to the issues by pkgdown's auto-linking feature. Issue numbers have been rewritten based on the following correspondence. Also, the pkgdown settings have been changed and updated to link to GitHub. I generated the Changelog page using the `pkgdown::build_news()` function and verified that the links work correctly. --- ARROW-6338 #5198 ARROW-6364 #5201 ARROW-6323 #5169 ARROW-6278 #5141 ARROW-6360 #5329 ARROW-6533 #5450 ARROW-6348 #5223 ARROW-6337 #5399 ARROW-10850 #9128 ARROW-10624 #9092 ARROW-10386 #8549 ARROW-6994 #23308 ARROW-12774 #10320 ARROW-12670 #10287 ARROW-16828 #13484 ARROW-14989 #13482 ARROW-16977 #13514 ARROW-13404 #10999 ARROW-16887 #13601 ARROW-15906 #13206 ARROW-15280 #13171 ARROW-16144 #13183 ARROW-16511 #13105 ARROW-16085 #13088 ARROW-16715 #13555 ARROW-16268 #13550 ARROW-16700 #13518 ARROW-16807 #13583 ARROW-16871 #13517 ARROW-16415 #13190 ARROW-14821 #12154 ARROW-16439 #13174 ARROW-16394 #13118 ARROW-16516 #13163 ARROW-16395 #13627 ARROW-14848 #12589 ARROW-16407 #13196 ARROW-16653 #13506 ARROW-14575 #13160 ARROW-15271 #13170 ARROW-16703 #13650 ARROW-16444 #13397 ARROW-15016 #13541 ARROW-16776 #13563 ARROW-15622 #13090 ARROW-18131 #14484 ARROW-18305 #14581 ARROW-18285 #14615 * Closes: #33631 Authored-by: SHIMA Tatsuya <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
No description provided.