-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
In order to support things like SF columns, we have added code that handles row-level metadata (#8549 and #9182).
These work just fine in a single table or single parquet file circumstance, but when using a dataset (even without filtering!) this can produce some surprising (and wrong) results (see reprex below).
There is already some work underway to make it easier to convert the row-element-level attributes to a struct + store it in the column in the ARROW-12542 work, but that's still a bit off. But even once that's done, should we disable this totally? Stop or ignore+warn that with datasets row-level metadata isn't applied (since there's no way for us to get the ordering right)? Something else?
library(arrow)
df <- tibble::tibble(
part = rep(1:2, 13),
let = letters
)
df$embedded_attr <- lapply(seq_len(nrow(df)), function(i) {
value <- "nothing"
attributes(value) <- list(letter = df[[i, "let"]])
value
})
df_from_tab <- as.data.frame(Table$create(df))
# this should be (and is) "b"
attributes(df_from_tab[df_from_tab$let == "b", "embedded_attr"][[1]][[1]])
#> $letter
#> [1] "b"
# the dfs are the same
waldo::compare(df, df_from_tab)
#> ✓ No differences
# now via dataset
dir <- "ds-dir"
write_dataset(df, path = dir, partitioning = "part")
ds <- open_dataset(dir)
df_from_ds <- dplyr::collect(ds)
# this should be (and is not) "b"
attributes(df_from_ds[df_from_ds$let == "b", "embedded_attr"][[1]][[1]])
#> $letter
#> [1] "n"
# Even controlling for order, the dfs are not the same
waldo::compare(dplyr::arrange(df, let), dplyr::arrange(df_from_ds, let))
#> `names(old)`: "part" "let" "embedded_attr"
#> `names(new)`: "let" "embedded_attr" "part"
#>
#> `attr(old$embedded_attr[[2]], 'letter')`: "b"
#> `attr(new$embedded_attr[[2]], 'letter')`: "n"
#>
#> `attr(old$embedded_attr[[3]], 'letter')`: "c"
#> `attr(new$embedded_attr[[3]], 'letter')`: "b"
#>
#> `attr(old$embedded_attr[[4]], 'letter')`: "d"
#> `attr(new$embedded_attr[[4]], 'letter')`: "o"
#>
#> `attr(old$embedded_attr[[5]], 'letter')`: "e"
#> `attr(new$embedded_attr[[5]], 'letter')`: "c"
#>
#> `attr(old$embedded_attr[[6]], 'letter')`: "f"
#> `attr(new$embedded_attr[[6]], 'letter')`: "p"
#>
#> `attr(old$embedded_attr[[7]], 'letter')`: "g"
#> `attr(new$embedded_attr[[7]], 'letter')`: "d"
#>
#> `attr(old$embedded_attr[[8]], 'letter')`: "h"
#> `attr(new$embedded_attr[[8]], 'letter')`: "q"
#>
#> `attr(old$embedded_attr[[9]], 'letter')`: "i"
#> `attr(new$embedded_attr[[9]], 'letter')`: "e"
#>
#> `attr(old$embedded_attr[[10]], 'letter')`: "j"
#> `attr(new$embedded_attr[[10]], 'letter')`: "r"
#>
#> And 15 more differences ...Reporter: Jonathan Keane / @jonkeane
Assignee: Jonathan Keane / @jonkeane
Related issues:
- [R] SF columns in datasets with filters (is duplicated by)
PRs and other links:
Note: This issue was originally created as ARROW-13189. Please see the migration documentation for further details.