-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-34775: [R] arrow_table: as.data.frame() sometimes returns a tbl and sometimes a data.frame #34825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
OK, not loving this solution as I've got it so far, as the failing tests are due to the fact that we use We can't just swap it for I could write a new internal function for use in these circumstances which returns tibbles if the package is installed or May have to revert this PR to just fix the argument-ordering bug, and leave Would be good to get your thoughts here, @paleolimbot |
I think we can safely add tibble to Suggests and use I think we should definitely make this change! Type stability is important and I myself have written |
To clarify, I'm talking about the functions themselves, not the tests. We call |
|
I see...I would be +1 on having |
|
OK, I'm in favour of taking it on as a new dependency, though it's a big enough change that I'm going to add it as a discussion topic in the dev meeting on Thursday before pushing it ahead. |
thisisnic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll wait til we're happy with the overall PR before doing this, but we should absolutely update the docs before merging this.
|
For people using it in combination with the data.table package (Rdatatable/data.table#2026), |
|
As evidenced by the hundreds of test failures when we removed this default, I think our If data.table wants fread on IPC streams or feather, nanoarrow will probably be a better long-term solution (IPC support is in the C library although I haven't had time to do and R wrapper yet). |
|
If I remember correctly, > data.table::data.table(a = 1) |> arrow::write_parquet("test.parquet")
> arrow::read_parquet("test.parquet") |> class()
[1] "data.table" "data.frame"
> data.table::data.table(a = 1) |> arrow::write_feather("test.arrow")
> arrow::read_feather("test.arrow") |> class()
[1] "data.table" "data.frame"
> data.table::data.table(a = 1) |> arrow::write_feather("test.feather", version = "1")
> arrow::read_feather("test.feather") |> class()
[1] "tbl_df" "tbl" "data.frame"And since dplyr often returns tibble, it makes sense that executing
This is definitely great! |
|
Thanks folks for the discussion here, it's good to have more eyes on this.
That's right, they restore properties, but there was a little more going on here; sometimes we return a tibble and other times we return a |
|
I like the consistency of always returning a tibble and imo that would make sense seeing the overall alignment of the arrow package with dplyr. |
|
I believe we should avoid adding a hard dependency on tibble if we can avoid it, and provide a way to return a plain old base R data.frame for users who prefer that. |
|
Ah I thought we had it as an indirect dependency already... |
We already depend on all of tibble's dependencies because we have a hard dependency on vctrs...I don't think this is an issue.
Users that need a plain data.frame can do ...or if returning a tibble really is a problem, we should commit to returning only data.frames, because returning a tibble only sometimes is the worst of all worlds. |
|
@ianmcook Would you mind expanding more on why not? I don't feel like you'll be the only person with this opinion, and regardless of what we end up doing, I want to better understand the objections. There's also the possible workaround you mentioned elsewhere involving conditionally exporting the S3 generic that we should explore more as if it works it allows us to return tibbles but without the added dependency. |
|
It's true that tibble would not be our first hard dependency on a tidyverse package, but if I'm not mistaken, it would be our first hard tidyverse package dependency that is exposed to end users. I believe all of our other hard tidyverse package dependencies are just under-the-hood stuff. I'm a big tidyverse fan myself, but as a general rule I think we should aim for the arrow package to expose no tidyverse stuff to end users unless they explicitly ask for it. |
IMO it would be a reasonable behavior to return a tibble unless (a) the tibble package is not installed, or (b) the user sets an option to specify that they want a plain old data.frame. I think that's the best option. The second best option is always return a tibble. By far the worst option is always return a plain old data.frame. |
…methods for RecordBatchReaders
1f81490 to
d6020ad
Compare
I am against this solution because it does not solve the fact that The underlying problem is that this package is trying to be both user-facing (where tibbles are helpful since they print nicely in an interactive session) and developer-facing (where tibbles are unnecessary). I think this package should commit to be user-facing and let other packages like rpolars, nanoarrow, adbcdrivermanager provide interfaces that cater to developers/low-level users. |
|
I am wondering if such a case would eliminate the need to write (some) R attributes to the files? |
|
Given that nobody asked us to implement |
|
(FWIW I tried the conditional export approach I mentioned above but couldn't get it working). |
|
Apologies for being late to the discussion here, and forgive me if I'm missing something, but it seems like the odd behavior @thisisnic reported is a more narrow problem of how the auto-splicing works in But I'm not sure there's a more general problem to solve; or rather, I'm not sure there's a better solution than the one we have that meets our requirements. It seems like the constraints we're solving for, ranked by priority, are:
Our current approach is: when converting a Table to a data.frame, we set the class as We're relying on how S3 dispatch works, such that if you don't have To be clear, I think you/we could justify any number of tradeoffs along these dimensions, I just wanted to sketch out what I think they are. A few other specific thoughts:
Additional historical context, in case it's useful in understanding how we got here: |
|
As the user who reported the bug this PR is solving, I'd like to very succintly add my opinion. I personally strongly believe that Having (I won't contribute any more to the discussion and let you sort it out as you see fit, unless explicitly tagged) |
|
Thanks for summarising the extra context there that we didn't have before, @nealrichardson, that's super helpful; now I can see why things are the way they currently are. FWIW, I agree with your points Dean, but I don't see a reasonable solution to that problem which doesn't cause other issues given the different priorities we're balancing. I'm still on the fence regarding what is the best solution here, but given that the issue regarding the Open to more discussion here regarding what our priorities should be, if not the current ones in the order they are above. For example, I don't fully see the importance of roundtrip fidelity; if the input and output have the same (non-default) metadata and contents, then is there any harm caused by returning a tibble instead of a data.frame, if a user then just has the additional step of calling |
|
I can also attest to having to I also do not see the point of lossless roundtrip to/from a file by default. (The option should certainly exist to the extent we have the capacity to support it). We are as a package in a place where we need to move towards simplicity to reflect the fact that we have very, very, very few contributors. I do not think that having the end result of this PR be "it was too hard so we didn't fix it" is a sustainable path. |
That's not what's being said or done here. I'm separating out the different issues into smaller components. I'm finishing the small and easily solvable one, and pausing working on this one to wait for more discussion, as, as you said, we don't have many contributors, and I feel that we're starting to sacrifice momentum and speed of resolution for what could be modular issues in favour of heading towards yak shaving at the moment. Issue of variable return type due to argument order moved to #35038 |
|
Apologies for the indirection - splitting off the obvious bug is absolutely the right decision here! |
This PR currently updates the table creator to only extract metadata if a single data.frame has been passed in in the dots and nothing else - previously the metadata was extracted from the first item passed in if it is a
data.frame, resulting in inconsistent behaviour depending on argument order.It also ensures that the object returned from
as.data.frame()is always a vanilla data.frame (previously it sometimes was a tibble and sometimes was a data.frame).After:
I did look at implementing
as_tibble.ArrowTabular()but it's unnecessary asas_tibble()by default will callas.data.frame()on any object which doesn't have an S3 method implemented for anyway.