GH-34775: [R] arrow_table: as.data.frame() sometimes returns a tbl and sometimes a data.frame #34825

thisisnic · 2023-03-31T13:21:14Z

This PR currently updates the table creator to only extract metadata if a single data.frame has been passed in in the dots and nothing else - previously the metadata was extracted from the first item passed in if it is a data.frame, resulting in inconsistent behaviour depending on argument order.

It also ensures that the object returned from as.data.frame() is always a vanilla data.frame (previously it sometimes was a tibble and sometimes was a data.frame).

library(arrow)

tbl <- data.frame(x = 1)
x <- arrow_table(tbl, name = "1") 
y <- arrow_table(name = "1", tbl) 
z <- arrow_table(tbl)
x$r_metadata
#> $attributes
#> $attributes$class
#> [1] "data.frame"
#> 
#> 
#> $columns
#> $columns$x
#> NULL
#> 
#> $columns$name
#> NULL
y$r_metadata
#> NULL
z$r_metadata
#> $attributes
#> $attributes$class
#> [1] "data.frame"
#> 
#> 
#> $columns
#> $columns$x
#> NULL

class(as.data.frame(x))
#> [1] "data.frame"
class(as.data.frame(y))
#> [1] "tbl_df"     "tbl"        "data.frame"
class(as.data.frame(z))
#> [1] "data.frame"

After:

library(arrow)

tbl <- data.frame(x = 1)
x <- arrow_table(tbl, name = "1") 
y <- arrow_table(name = "1", tbl) 
z <- arrow_table(tbl)
x$r_metadata
#> NULL
y$r_metadata
#> NULL
z$r_metadata
#> $attributes
#> $attributes$class
#> [1] "data.frame"
#> 
#> 
#> $columns
#> $columns$x
#> NULL

class(as.data.frame(x))
#> [1] "data.frame"
class(as.data.frame(y))
#> [1] "data.frame"
class(as.data.frame(z))
#> [1] "data.frame"

I did look at implementing as_tibble.ArrowTabular() but it's unnecessary as as_tibble() by default will call as.data.frame() on any object which doesn't have an S3 method implemented for anyway.

github-actions · 2023-03-31T13:21:41Z

Closes: [R] arrow_table: as.data.frame() sometimes returns a tbl and sometimes a data.frame #34775

github-actions · 2023-03-31T13:21:44Z

⚠️ GitHub issue #34775 has been automatically assigned in GitHub to PR creator.

thisisnic · 2023-04-03T15:35:35Z

OK, not loving this solution as I've got it so far, as the failing tests are due to the fact that we use as.data.frame() internally in lots of functions like read_feather() etc, so the knock-on effect of this change is that we'd be returning data.frames in more circumstances than I anticipated, and now this change would be detrimental.

We can't just swap it for as_tibble() inside these functions as we don't have the tibble package as a dependency.

I could write a new internal function for use in these circumstances which returns tibbles if the package is installed or data.frames if not, but that seems wrong too?

May have to revert this PR to just fix the argument-ordering bug, and leave as.data.frame() as a function which usually returns tibbles but at least does it consistently.

Would be good to get your thoughts here, @paleolimbot

paleolimbot · 2023-04-03T18:36:39Z

We can't just swap it for as_tibble() inside these functions as we don't have the tibble package as a dependency.

I think we can safely add tibble to Suggests and use as_tibble() in tests that already expect a tibble? Unless I'm misunderstanding the failures?

I think we should definitely make this change! Type stability is important and I myself have written as.data.frame(as.data.frame()) too many times for this to be a fluke.

thisisnic · 2023-04-03T21:57:21Z

We can't just swap it for as_tibble() inside these functions as we don't have the tibble package as a dependency.

I think we can safely add tibble to Suggests and use as_tibble() in tests that already expect a tibble? Unless I'm misunderstanding the failures?

To clarify, I'm talking about the functions themselves, not the tests. We call as.data.frame() in the body of read_feather().

paleolimbot · 2023-04-04T01:38:07Z

I see...I would be +1 on having read_xxx() and dplyr::collect() always return tibbles (and adding tibble as a Imports dependency since not all of that requires dplyr). I think functionally that happened almost all of the time anyway (just via automagic metadata restoration instead of an explicit as_tibble() call. Having the explicit as_tibble() call is better, too, because with automagic metadata restoration the "tibble" namespace isn't necessarily loaded (meaning the nice print method and subset rules may or may not kick in depending on what libraries a user has loaded in their session).

thisisnic · 2023-04-04T06:08:13Z

OK, I'm in favour of taking it on as a new dependency, though it's a big enough change that I'm going to add it as a discussion topic in the dev meeting on Thursday before pushing it ahead.

thisisnic

I'll wait til we're happy with the overall PR before doing this, but we should absolutely update the docs before merging this.

eitsupi · 2023-04-04T14:46:18Z

For people using it in combination with the data.table package (Rdatatable/data.table#2026), as_tibble may not be ideal?

paleolimbot · 2023-04-04T14:54:20Z

As evidenced by the hundreds of test failures when we removed this default, I think our read_XX() and collect() implementations already returned tibbles (or might have returned tibbles): using as_tibble() just adds predictability to that process.

If data.table wants fread on IPC streams or feather, nanoarrow will probably be a better long-term solution (IPC support is in the C library although I haven't had time to do and R wrapper yet).

eitsupi · 2023-04-04T15:05:51Z

If I remember correctly, read_feather and read_parquet restore the attributes of R as they were before they were written, so if we didn't write tibble to arrow file or parquet, it didn't become tibble, I think.
On the other hand, I think Feather V1 was always tibble.

> data.table::data.table(a = 1) |> arrow::write_parquet("test.parquet")

> arrow::read_parquet("test.parquet") |> class()
[1] "data.table" "data.frame"

> data.table::data.table(a = 1) |> arrow::write_feather("test.arrow")

> arrow::read_feather("test.arrow") |> class()
[1] "data.table" "data.frame"

> data.table::data.table(a = 1) |> arrow::write_feather("test.feather", version = "1")

> arrow::read_feather("test.feather") |> class()
[1] "tbl_df"     "tbl"        "data.frame"

And since dplyr often returns tibble, it makes sense that executing collect would result in tibble.

If data.table wants fread on IPC streams or feather, nanoarrow will probably be a better long-term solution (IPC support is in the C library although I haven't had time to do and R wrapper yet).

This is definitely great!
I just wanted to point out that there may be users of data.table.

thisisnic · 2023-04-04T16:25:20Z

Thanks folks for the discussion here, it's good to have more eyes on this.

If I remember correctly, read_feather and read_parquet restore the attributes of R as they were before they were written, so if we didn't write tibble to arrow file or parquet, it didn't become tibble, I think.

That's right, they restore properties, but there was a little more going on here; sometimes we return a tibble and other times we return a data.frame, which isn't great. That was the original purpose of this PR - to fix that - but in fixing it, it became more complex.

assignUser · 2023-04-04T18:31:49Z

I like the consistency of always returning a tibble and imo that would make sense seeing the overall alignment of the arrow package with dplyr.

ianmcook · 2023-04-04T18:36:29Z

I believe we should avoid adding a hard dependency on tibble if we can avoid it, and provide a way to return a plain old base R data.frame for users who prefer that.

assignUser · 2023-04-04T18:53:03Z

Ah I thought we had it as an indirect dependency already...

paleolimbot · 2023-04-04T19:17:06Z

I believe we should avoid adding a hard dependency on tibble if we can avoid it

We already depend on all of tibble's dependencies because we have a hard dependency on vctrs...I don't think this is an issue.

and provide a way to return a plain old base R data.frame for users who prefer that

Users that need a plain data.frame can do as.data.frame(read_whatever())? (They need to do that today, anyway, because whoever wrote the data might have written a tibble). We adopt the tidyverse paradigm almost everywhere else in an attempt to minimize the number of things users have to get used to to use Arrow (e.g., our function arguments intentionally mirror readr's). It's not that wanting a base R data frame is a bad thing or that users shouldn't want it, it's that it's an inconsistent design choice with respect to the rest of the package.

...or if returning a tibble really is a problem, we should commit to returning only data.frames, because returning a tibble only sometimes is the worst of all worlds.

thisisnic · 2023-04-04T21:33:40Z

@ianmcook Would you mind expanding more on why not? I don't feel like you'll be the only person with this opinion, and regardless of what we end up doing, I want to better understand the objections.

There's also the possible workaround you mentioned elsewhere involving conditionally exporting the S3 generic that we should explore more as if it works it allows us to return tibbles but without the added dependency.

ianmcook · 2023-04-04T22:05:40Z

It's true that tibble would not be our first hard dependency on a tidyverse package, but if I'm not mistaken, it would be our first hard tidyverse package dependency that is exposed to end users. I believe all of our other hard tidyverse package dependencies are just under-the-hood stuff. I'm a big tidyverse fan myself, but as a general rule I think we should aim for the arrow package to expose no tidyverse stuff to end users unless they explicitly ask for it.

ianmcook · 2023-04-05T00:26:58Z

...or if returning a tibble really is a problem, we should commit to returning only data.frames, because returning a tibble only sometimes is the worst of all worlds.

IMO it would be a reasonable behavior to return a tibble unless (a) the tibble package is not installed, or (b) the user sets an option to specify that they want a plain old data.frame.

I think that's the best option.

The second best option is always return a tibble.

By far the worst option is always return a plain old data.frame.

…methods for RecordBatchReaders

paleolimbot · 2023-04-05T15:10:36Z

IMO it would be a reasonable behavior to return a tibble unless (a) the tibble package is not installed, or (b) the user sets an option to specify that they want a plain old data.frame.

I am against this solution because it does not solve the fact that read_whatever() might return a data.frame or a tibble and because it adds unnecessary complexity to our code base.

The underlying problem is that this package is trying to be both user-facing (where tibbles are helpful since they print nicely in an interactive session) and developer-facing (where tibbles are unnecessary). I think this package should commit to be user-facing and let other packages like rpolars, nanoarrow, adbcdrivermanager provide interfaces that cater to developers/low-level users.

eitsupi · 2023-04-05T15:27:42Z

I am wondering if such a case would eliminate the need to write (some) R attributes to the files?
Is there a benefit to continuing to write these to files?

thisisnic · 2023-04-05T15:28:13Z

Given that nobody asked us to implement as_tibble() and this is contentious, I'm tempted to reduce the scope of this PR to guaranteeing that as.data.frame() returns a base R data.frame, picking a new (internal) function to use in the body of the read_*() functions instead of as.data.frame(), and not implementing as_tibble() in this PR. We'd be left with the problem of still variably returning data.frames vs. tibbles from the read_* functions, but can defer that to another discussion/ticket.

thisisnic · 2023-04-05T15:46:03Z

(FWIW I tried the conditional export approach I mentioned above but couldn't get it working).

nealrichardson · 2023-04-06T18:49:24Z

Apologies for being late to the discussion here, and forgive me if I'm missing something, but it seems like the odd behavior @thisisnic reported is a more narrow problem of how the auto-splicing works in arrow_table() / Table$create(). That was added in #4704, and there was a TODO issue made about improving the C++ code around it (#22186), though I don't think this particular issue was what they had in mind. This seems worth fixing.

But I'm not sure there's a more general problem to solve; or rather, I'm not sure there's a better solution than the one we have that meets our requirements. It seems like the constraints we're solving for, ranked by priority, are:

as.data.frame(ArrowObject) should return an object where is.data.frame() is TRUE and can be used by any other R functions. No Arrow methods required to work with it.
Fidelity: when a user saves an R data object to Arrow or Parquet, they should get the same R object back when they load it, including metadata.
Internal coherence: the same kind of operation in various places should yield the same kind of result.
tbls are useful because they print nicer than vanilla data.frames.
Because we want Arrow to have broad adoption and appeal, we want to avoid adding hard dependencies.

Our current approach is: when converting a Table to a data.frame, we set the class as c("tbl_df", "tbl", "data.frame"). instead of just "data.frame" But if there is R metadata saved that overwrites the class, we use that so that we can faithfully round-trip R objects. Leaving aside the autosplice issue, I think that's all that's happening.

We're relying on how S3 dispatch works, such that if you don't have tibble loaded and print a c("tbl_df", "tbl", "data.frame"), it just prints like a data.frame, so we can both (a) return tbls and (b) not depend on tibble just fine. So we can get #4 and #5, but because of #2, we sacrifice a small amount of #3: we get the weaker inherits(x, "data.frame") requirement and not identical(class(x), "data.frame"). I'm of the opinion that this is a reasonable tradeoff since tibbles are just subclasses of data.frame--you always get back a data.frame. We could push back the other way, but I think we have to give up some #4 as a result.

To be clear, I think you/we could justify any number of tradeoffs along these dimensions, I just wanted to sketch out what I think they are.

A few other specific thoughts:

I'm not sure you can make as.data.frame() always return class == "data.frame" without either sacrificing roundtrip fidelity or inventing a new function that means "take this Arrow Table with metadata and make an R object". The latter doesn't sound like something we want to do.
The read_xxx() functions can't really be strict about class without losing roundtrip fidelity. For me, that's enough to say inherits(x, "data.frame") is sufficient and we don't have to always return identical class.
Seems like dplyr::collect() should always return tibble, regardless of other changes you make (I believe it already does, given how the ExecPlan evaluation works).

Additional historical context, in case it's useful in understanding how we got here:

daattali · 2023-04-10T06:00:06Z

As the user who reported the bug this PR is solving, I'd like to very succintly add my opinion.

I personally strongly believe that as.data.frame() should return a data.frame. Plain and simple. That's just what would be intuitively expected. If I cast any object with as.X() then I expect the result to be exactly X, not a clever derivation/subclass of X. If I want a tibble, then I'd use for as.tibble().

Having as.data.frame(x) and as.data.frame(as.data.frame(x)) return different types feels very wrong.

(I won't contribute any more to the discussion and let you sort it out as you see fit, unless explicitly tagged)

thisisnic · 2023-04-11T12:34:59Z

Thanks for summarising the extra context there that we didn't have before, @nealrichardson, that's super helpful; now I can see why things are the way they currently are.

FWIW, I agree with your points Dean, but I don't see a reasonable solution to that problem which doesn't cause other issues given the different priorities we're balancing.

I'm still on the fence regarding what is the best solution here, but given that the issue regarding the as.data.frame return type is 1) fairly complex, 2) contentious, 3) not the entire issue here, and 4) not something that has been reported by anyone else (yet), I'm going to close this PR and submit a new one for the narrower problem of variable results based on argument ordering.

Open to more discussion here regarding what our priorities should be, if not the current ones in the order they are above.

For example, I don't fully see the importance of roundtrip fidelity; if the input and output have the same (non-default) metadata and contents, then is there any harm caused by returning a tibble instead of a data.frame, if a user then just has the additional step of calling as.data.frame() to get a vanilla data.frame? Haven't sketched that out in practice though.

paleolimbot · 2023-04-11T13:05:04Z

I can also attest to having to as.data.frame(as.data.frame(arrow_stuff)) and having it be really annoying. This usually comes up in tests where expect_identical() complains or in benchmarks where bench::mark() complains about non-equality. If it was okay for as.data.frame() to return a tibble, then as.data.frame() on a tibble would return a tibble (it does not).

I also do not see the point of lossless roundtrip to/from a file by default. (The option should certainly exist to the extent we have the capacity to support it).

We are as a package in a place where we need to move towards simplicity to reflect the fact that we have very, very, very few contributors. I do not think that having the end result of this PR be "it was too hard so we didn't fix it" is a sustainable path.

thisisnic · 2023-04-11T13:21:00Z

having the end result of this PR be "it was too hard so we didn't fix it" is a sustainable path.

That's not what's being said or done here. I'm separating out the different issues into smaller components. I'm finishing the small and easily solvable one, and pausing working on this one to wait for more discussion, as, as you said, we don't have many contributors, and I feel that we're starting to sacrifice momentum and speed of resolution for what could be modular issues in favour of heading towards yak shaving at the moment.

Issue of variable return type due to argument order moved to #35038

paleolimbot · 2023-04-11T14:57:47Z

Apologies for the indirection - splitting off the obvious bug is absolutely the right decision here!

github-actions bot added Component: R awaiting committer review Awaiting committer review labels Mar 31, 2023

github-actions bot added the Component: Parquet label Apr 4, 2023

thisisnic commented Apr 4, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Apr 4, 2023

thisisnic added 4 commits April 5, 2023 13:25

Only preserve metadata if input df is 1 item long

2477759

Run linter

16d6764

Make as.data.frame only return data.frames

475eecb

Lint away...

5accf1d

thisisnic added 11 commits April 5, 2023 13:25

Update tests to expect tibbles and not data.frames

a318dae

Map a few more as.data.frames to the correct thing

467f9b4

Add as_tibble.ArrowTabular

16ba00e

Update NAMESPACE

e7d4838

Fix more tests

ebb089e

Clean up some as.data.frames which should be as_tibble and implement …

88a1c55

…methods for RecordBatchReaders

Fix metadata tests

52d2be2

Add new methods

0380f8a

Fix utf tests

c9e50ce

Update other uses of as.data.frame in tests

6cca641

Fix Flight tests

d6020ad

thisisnic force-pushed the GH-34775_metadata branch from 1f81490 to d6020ad Compare April 5, 2023 12:27

thisisnic added 5 commits April 6, 2023 13:31

Revert tibble dependency and update internal function usage to as_df

9cf9ba2

Revert expect_tibble

02dde1f

Define as_df generic

9876de2

Remove redundant test

ee12b68

Update test to not expect tibble

2ea92ca

thisisnic mentioned this pull request Apr 11, 2023

[R] argument order in arrow_table affects object return type #35038

Closed

thisisnic closed this Apr 11, 2023

thisisnic mentioned this pull request Oct 28, 2024

[R] When write parquet as data.frame we lost information about class. #44524

Closed

GH-34775: [R] arrow_table: as.data.frame() sometimes returns a tbl and sometimes a data.frame #34825

GH-34775: [R] arrow_table: as.data.frame() sometimes returns a tbl and sometimes a data.frame #34825

Uh oh!

Conversation

thisisnic commented Mar 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 31, 2023

Uh oh!

github-actions bot commented Mar 31, 2023

Uh oh!

thisisnic commented Apr 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paleolimbot commented Apr 3, 2023

Uh oh!

thisisnic commented Apr 3, 2023

Uh oh!

paleolimbot commented Apr 4, 2023

Uh oh!

thisisnic commented Apr 4, 2023

Uh oh!

thisisnic left a comment

Choose a reason for hiding this comment

Uh oh!

eitsupi commented Apr 4, 2023

Uh oh!

paleolimbot commented Apr 4, 2023

Uh oh!

eitsupi commented Apr 4, 2023

Uh oh!

thisisnic commented Apr 4, 2023

Uh oh!

assignUser commented Apr 4, 2023

Uh oh!

ianmcook commented Apr 4, 2023

Uh oh!

assignUser commented Apr 4, 2023

Uh oh!

paleolimbot commented Apr 4, 2023

Uh oh!

thisisnic commented Apr 4, 2023

Uh oh!

ianmcook commented Apr 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ianmcook commented Apr 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paleolimbot commented Apr 5, 2023

Uh oh!

eitsupi commented Apr 5, 2023

Uh oh!

thisisnic commented Apr 5, 2023

Uh oh!

thisisnic commented Apr 5, 2023

Uh oh!

nealrichardson commented Apr 6, 2023

Uh oh!

daattali commented Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thisisnic commented Apr 11, 2023

Uh oh!

paleolimbot commented Apr 11, 2023

Uh oh!

thisisnic commented Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paleolimbot commented Apr 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

thisisnic commented Mar 31, 2023 •

edited

Loading

thisisnic commented Apr 3, 2023 •

edited

Loading

ianmcook commented Apr 4, 2023 •

edited

Loading

ianmcook commented Apr 5, 2023 •

edited

Loading

daattali commented Apr 10, 2023 •

edited

Loading

thisisnic commented Apr 11, 2023 •

edited

Loading