Skip to content

[R] Disable row-level metadata application on datasets #28879

@asfimport

Description

@asfimport

In order to support things like SF columns, we have added code that handles row-level metadata (#8549 and #9182).

These work just fine in a single table or single parquet file circumstance, but when using a dataset (even without filtering!) this can produce some surprising (and wrong) results (see reprex below).

There is already some work underway to make it easier to convert the row-element-level attributes to a struct + store it in the column in the ARROW-12542 work, but that's still a bit off. But even once that's done, should we disable this totally? Stop or ignore+warn that with datasets row-level metadata isn't applied (since there's no way for us to get the ordering right)? Something else?

library(arrow)

df <- tibble::tibble(
  part = rep(1:2, 13),
  let = letters
)

df$embedded_attr <- lapply(seq_len(nrow(df)), function(i) {
  value <- "nothing"
  attributes(value) <- list(letter = df[[i, "let"]])
  value
})
df_from_tab <- as.data.frame(Table$create(df))

# this should be (and is) "b"
attributes(df_from_tab[df_from_tab$let == "b", "embedded_attr"][[1]][[1]])
#> $letter
#> [1] "b"

# the dfs are the same
waldo::compare(df, df_from_tab)
#> ✓ No differences

# now via dataset
dir <- "ds-dir"
write_dataset(df, path = dir, partitioning = "part")

ds <- open_dataset(dir)
df_from_ds <- dplyr::collect(ds)

# this should be (and is not) "b"
attributes(df_from_ds[df_from_ds$let == "b", "embedded_attr"][[1]][[1]])
#> $letter
#> [1] "n"

# Even controlling for order, the dfs are not the same
waldo::compare(dplyr::arrange(df, let), dplyr::arrange(df_from_ds, let))
#> `names(old)`: "part" "let" "embedded_attr"       
#> `names(new)`:        "let" "embedded_attr" "part"
#> 
#> `attr(old$embedded_attr[[2]], 'letter')`: "b"
#> `attr(new$embedded_attr[[2]], 'letter')`: "n"
#> 
#> `attr(old$embedded_attr[[3]], 'letter')`: "c"
#> `attr(new$embedded_attr[[3]], 'letter')`: "b"
#> 
#> `attr(old$embedded_attr[[4]], 'letter')`: "d"
#> `attr(new$embedded_attr[[4]], 'letter')`: "o"
#> 
#> `attr(old$embedded_attr[[5]], 'letter')`: "e"
#> `attr(new$embedded_attr[[5]], 'letter')`: "c"
#> 
#> `attr(old$embedded_attr[[6]], 'letter')`: "f"
#> `attr(new$embedded_attr[[6]], 'letter')`: "p"
#> 
#> `attr(old$embedded_attr[[7]], 'letter')`: "g"
#> `attr(new$embedded_attr[[7]], 'letter')`: "d"
#> 
#> `attr(old$embedded_attr[[8]], 'letter')`: "h"
#> `attr(new$embedded_attr[[8]], 'letter')`: "q"
#> 
#> `attr(old$embedded_attr[[9]], 'letter')`: "i"
#> `attr(new$embedded_attr[[9]], 'letter')`: "e"
#> 
#> `attr(old$embedded_attr[[10]], 'letter')`: "j"
#> `attr(new$embedded_attr[[10]], 'letter')`: "r"
#> 
#> And 15 more differences ...

Reporter: Jonathan Keane / @jonkeane
Assignee: Jonathan Keane / @jonkeane

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-13189. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions