Skip to content

Conversation

@thisisnic
Copy link
Member

@thisisnic thisisnic commented Mar 31, 2023

This PR currently updates the table creator to only extract metadata if a single data.frame has been passed in in the dots and nothing else - previously the metadata was extracted from the first item passed in if it is a data.frame, resulting in inconsistent behaviour depending on argument order.

It also ensures that the object returned from as.data.frame() is always a vanilla data.frame (previously it sometimes was a tibble and sometimes was a data.frame).

library(arrow)

tbl <- data.frame(x = 1)
x <- arrow_table(tbl, name = "1") 
y <- arrow_table(name = "1", tbl) 
z <- arrow_table(tbl)
x$r_metadata
#> $attributes
#> $attributes$class
#> [1] "data.frame"
#> 
#> 
#> $columns
#> $columns$x
#> NULL
#> 
#> $columns$name
#> NULL
y$r_metadata
#> NULL
z$r_metadata
#> $attributes
#> $attributes$class
#> [1] "data.frame"
#> 
#> 
#> $columns
#> $columns$x
#> NULL

class(as.data.frame(x))
#> [1] "data.frame"
class(as.data.frame(y))
#> [1] "tbl_df"     "tbl"        "data.frame"
class(as.data.frame(z))
#> [1] "data.frame"

After:

library(arrow)

tbl <- data.frame(x = 1)
x <- arrow_table(tbl, name = "1") 
y <- arrow_table(name = "1", tbl) 
z <- arrow_table(tbl)
x$r_metadata
#> NULL
y$r_metadata
#> NULL
z$r_metadata
#> $attributes
#> $attributes$class
#> [1] "data.frame"
#> 
#> 
#> $columns
#> $columns$x
#> NULL

class(as.data.frame(x))
#> [1] "data.frame"
class(as.data.frame(y))
#> [1] "data.frame"
class(as.data.frame(z))
#> [1] "data.frame"

I did look at implementing as_tibble.ArrowTabular() but it's unnecessary as as_tibble() by default will call as.data.frame() on any object which doesn't have an S3 method implemented for anyway.

@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #34775 has been automatically assigned in GitHub to PR creator.

@thisisnic
Copy link
Member Author

thisisnic commented Apr 3, 2023

OK, not loving this solution as I've got it so far, as the failing tests are due to the fact that we use as.data.frame() internally in lots of functions like read_feather() etc, so the knock-on effect of this change is that we'd be returning data.frames in more circumstances than I anticipated, and now this change would be detrimental.

We can't just swap it for as_tibble() inside these functions as we don't have the tibble package as a dependency.

I could write a new internal function for use in these circumstances which returns tibbles if the package is installed or data.frames if not, but that seems wrong too?

May have to revert this PR to just fix the argument-ordering bug, and leave as.data.frame() as a function which usually returns tibbles but at least does it consistently.

Would be good to get your thoughts here, @paleolimbot

@paleolimbot
Copy link
Member

We can't just swap it for as_tibble() inside these functions as we don't have the tibble package as a dependency.

I think we can safely add tibble to Suggests and use as_tibble() in tests that already expect a tibble? Unless I'm misunderstanding the failures?

I think we should definitely make this change! Type stability is important and I myself have written as.data.frame(as.data.frame()) too many times for this to be a fluke.

@thisisnic
Copy link
Member Author

We can't just swap it for as_tibble() inside these functions as we don't have the tibble package as a dependency.

I think we can safely add tibble to Suggests and use as_tibble() in tests that already expect a tibble? Unless I'm misunderstanding the failures?

To clarify, I'm talking about the functions themselves, not the tests. We call as.data.frame() in the body of read_feather().

@paleolimbot
Copy link
Member

I see...I would be +1 on having read_xxx() and dplyr::collect() always return tibbles (and adding tibble as a Imports dependency since not all of that requires dplyr). I think functionally that happened almost all of the time anyway (just via automagic metadata restoration instead of an explicit as_tibble() call. Having the explicit as_tibble() call is better, too, because with automagic metadata restoration the "tibble" namespace isn't necessarily loaded (meaning the nice print method and subset rules may or may not kick in depending on what libraries a user has loaded in their session).

@thisisnic
Copy link
Member Author

OK, I'm in favour of taking it on as a new dependency, though it's a big enough change that I'm going to add it as a discussion topic in the dev meeting on Thursday before pushing it ahead.

Copy link
Member Author

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll wait til we're happy with the overall PR before doing this, but we should absolutely update the docs before merging this.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Apr 4, 2023
@eitsupi
Copy link
Contributor

eitsupi commented Apr 4, 2023

For people using it in combination with the data.table package (Rdatatable/data.table#2026), as_tibble may not be ideal?

@paleolimbot
Copy link
Member

As evidenced by the hundreds of test failures when we removed this default, I think our read_XX() and collect() implementations already returned tibbles (or might have returned tibbles): using as_tibble() just adds predictability to that process.

If data.table wants fread on IPC streams or feather, nanoarrow will probably be a better long-term solution (IPC support is in the C library although I haven't had time to do and R wrapper yet).

@eitsupi
Copy link
Contributor

eitsupi commented Apr 4, 2023

If I remember correctly, read_feather and read_parquet restore the attributes of R as they were before they were written, so if we didn't write tibble to arrow file or parquet, it didn't become tibble, I think.
On the other hand, I think Feather V1 was always tibble.

> data.table::data.table(a = 1) |> arrow::write_parquet("test.parquet")

> arrow::read_parquet("test.parquet") |> class()
[1] "data.table" "data.frame"

> data.table::data.table(a = 1) |> arrow::write_feather("test.arrow")

> arrow::read_feather("test.arrow") |> class()
[1] "data.table" "data.frame"

> data.table::data.table(a = 1) |> arrow::write_feather("test.feather", version = "1")

> arrow::read_feather("test.feather") |> class()
[1] "tbl_df"     "tbl"        "data.frame"

And since dplyr often returns tibble, it makes sense that executing collect would result in tibble.

If data.table wants fread on IPC streams or feather, nanoarrow will probably be a better long-term solution (IPC support is in the C library although I haven't had time to do and R wrapper yet).

This is definitely great!
I just wanted to point out that there may be users of data.table.

@thisisnic
Copy link
Member Author

Thanks folks for the discussion here, it's good to have more eyes on this.

If I remember correctly, read_feather and read_parquet restore the attributes of R as they were before they were written, so if we didn't write tibble to arrow file or parquet, it didn't become tibble, I think.

That's right, they restore properties, but there was a little more going on here; sometimes we return a tibble and other times we return a data.frame, which isn't great. That was the original purpose of this PR - to fix that - but in fixing it, it became more complex.

@assignUser
Copy link
Member

I like the consistency of always returning a tibble and imo that would make sense seeing the overall alignment of the arrow package with dplyr.

@ianmcook
Copy link
Member

ianmcook commented Apr 4, 2023

I believe we should avoid adding a hard dependency on tibble if we can avoid it, and provide a way to return a plain old base R data.frame for users who prefer that.

@assignUser
Copy link
Member

Ah I thought we had it as an indirect dependency already...

@paleolimbot
Copy link
Member

I believe we should avoid adding a hard dependency on tibble if we can avoid it

We already depend on all of tibble's dependencies because we have a hard dependency on vctrs...I don't think this is an issue.

and provide a way to return a plain old base R data.frame for users who prefer that

Users that need a plain data.frame can do as.data.frame(read_whatever())? (They need to do that today, anyway, because whoever wrote the data might have written a tibble). We adopt the tidyverse paradigm almost everywhere else in an attempt to minimize the number of things users have to get used to to use Arrow (e.g., our function arguments intentionally mirror readr's). It's not that wanting a base R data frame is a bad thing or that users shouldn't want it, it's that it's an inconsistent design choice with respect to the rest of the package.

...or if returning a tibble really is a problem, we should commit to returning only data.frames, because returning a tibble only sometimes is the worst of all worlds.

@thisisnic
Copy link
Member Author

@ianmcook Would you mind expanding more on why not? I don't feel like you'll be the only person with this opinion, and regardless of what we end up doing, I want to better understand the objections.

There's also the possible workaround you mentioned elsewhere involving conditionally exporting the S3 generic that we should explore more as if it works it allows us to return tibbles but without the added dependency.

@ianmcook
Copy link
Member

ianmcook commented Apr 4, 2023

It's true that tibble would not be our first hard dependency on a tidyverse package, but if I'm not mistaken, it would be our first hard tidyverse package dependency that is exposed to end users. I believe all of our other hard tidyverse package dependencies are just under-the-hood stuff. I'm a big tidyverse fan myself, but as a general rule I think we should aim for the arrow package to expose no tidyverse stuff to end users unless they explicitly ask for it.

@ianmcook
Copy link
Member

ianmcook commented Apr 5, 2023

...or if returning a tibble really is a problem, we should commit to returning only data.frames, because returning a tibble only sometimes is the worst of all worlds.

IMO it would be a reasonable behavior to return a tibble unless (a) the tibble package is not installed, or (b) the user sets an option to specify that they want a plain old data.frame.

I think that's the best option.

The second best option is always return a tibble.

By far the worst option is always return a plain old data.frame.

@thisisnic thisisnic force-pushed the GH-34775_metadata branch from 1f81490 to d6020ad Compare April 5, 2023 12:27
@paleolimbot
Copy link
Member

IMO it would be a reasonable behavior to return a tibble unless (a) the tibble package is not installed, or (b) the user sets an option to specify that they want a plain old data.frame.

I am against this solution because it does not solve the fact that read_whatever() might return a data.frame or a tibble and because it adds unnecessary complexity to our code base.

The underlying problem is that this package is trying to be both user-facing (where tibbles are helpful since they print nicely in an interactive session) and developer-facing (where tibbles are unnecessary). I think this package should commit to be user-facing and let other packages like rpolars, nanoarrow, adbcdrivermanager provide interfaces that cater to developers/low-level users.

@eitsupi
Copy link
Contributor

eitsupi commented Apr 5, 2023

I am wondering if such a case would eliminate the need to write (some) R attributes to the files?
Is there a benefit to continuing to write these to files?

@thisisnic
Copy link
Member Author

Given that nobody asked us to implement as_tibble() and this is contentious, I'm tempted to reduce the scope of this PR to guaranteeing that as.data.frame() returns a base R data.frame, picking a new (internal) function to use in the body of the read_*() functions instead of as.data.frame(), and not implementing as_tibble() in this PR. We'd be left with the problem of still variably returning data.frames vs. tibbles from the read_* functions, but can defer that to another discussion/ticket.

@thisisnic
Copy link
Member Author

(FWIW I tried the conditional export approach I mentioned above but couldn't get it working).

@nealrichardson
Copy link
Member

Apologies for being late to the discussion here, and forgive me if I'm missing something, but it seems like the odd behavior @thisisnic reported is a more narrow problem of how the auto-splicing works in arrow_table() / Table$create(). That was added in #4704, and there was a TODO issue made about improving the C++ code around it (#22186), though I don't think this particular issue was what they had in mind. This seems worth fixing.

But I'm not sure there's a more general problem to solve; or rather, I'm not sure there's a better solution than the one we have that meets our requirements. It seems like the constraints we're solving for, ranked by priority, are:

  1. as.data.frame(ArrowObject) should return an object where is.data.frame() is TRUE and can be used by any other R functions. No Arrow methods required to work with it.
  2. Fidelity: when a user saves an R data object to Arrow or Parquet, they should get the same R object back when they load it, including metadata.
  3. Internal coherence: the same kind of operation in various places should yield the same kind of result.
  4. tbls are useful because they print nicer than vanilla data.frames.
  5. Because we want Arrow to have broad adoption and appeal, we want to avoid adding hard dependencies.

Our current approach is: when converting a Table to a data.frame, we set the class as c("tbl_df", "tbl", "data.frame"). instead of just "data.frame" But if there is R metadata saved that overwrites the class, we use that so that we can faithfully round-trip R objects. Leaving aside the autosplice issue, I think that's all that's happening.

We're relying on how S3 dispatch works, such that if you don't have tibble loaded and print a c("tbl_df", "tbl", "data.frame"), it just prints like a data.frame, so we can both (a) return tbls and (b) not depend on tibble just fine. So we can get #4 and #5, but because of #2, we sacrifice a small amount of #3: we get the weaker inherits(x, "data.frame") requirement and not identical(class(x), "data.frame"). I'm of the opinion that this is a reasonable tradeoff since tibbles are just subclasses of data.frame--you always get back a data.frame. We could push back the other way, but I think we have to give up some #4 as a result.

To be clear, I think you/we could justify any number of tradeoffs along these dimensions, I just wanted to sketch out what I think they are.

A few other specific thoughts:

  • I'm not sure you can make as.data.frame() always return class == "data.frame" without either sacrificing roundtrip fidelity or inventing a new function that means "take this Arrow Table with metadata and make an R object". The latter doesn't sound like something we want to do.
  • The read_xxx() functions can't really be strict about class without losing roundtrip fidelity. For me, that's enough to say inherits(x, "data.frame") is sufficient and we don't have to always return identical class.
  • Seems like dplyr::collect() should always return tibble, regardless of other changes you make (I believe it already does, given how the ExecPlan evaluation works).

Additional historical context, in case it's useful in understanding how we got here:

@daattali
Copy link
Contributor

daattali commented Apr 10, 2023

As the user who reported the bug this PR is solving, I'd like to very succintly add my opinion.

I personally strongly believe that as.data.frame() should return a data.frame. Plain and simple. That's just what would be intuitively expected. If I cast any object with as.X() then I expect the result to be exactly X, not a clever derivation/subclass of X. If I want a tibble, then I'd use for as.tibble().

Having as.data.frame(x) and as.data.frame(as.data.frame(x)) return different types feels very wrong.

(I won't contribute any more to the discussion and let you sort it out as you see fit, unless explicitly tagged)

@thisisnic
Copy link
Member Author

Thanks for summarising the extra context there that we didn't have before, @nealrichardson, that's super helpful; now I can see why things are the way they currently are.

FWIW, I agree with your points Dean, but I don't see a reasonable solution to that problem which doesn't cause other issues given the different priorities we're balancing.

I'm still on the fence regarding what is the best solution here, but given that the issue regarding the as.data.frame return type is 1) fairly complex, 2) contentious, 3) not the entire issue here, and 4) not something that has been reported by anyone else (yet), I'm going to close this PR and submit a new one for the narrower problem of variable results based on argument ordering.

Open to more discussion here regarding what our priorities should be, if not the current ones in the order they are above.

For example, I don't fully see the importance of roundtrip fidelity; if the input and output have the same (non-default) metadata and contents, then is there any harm caused by returning a tibble instead of a data.frame, if a user then just has the additional step of calling as.data.frame() to get a vanilla data.frame? Haven't sketched that out in practice though.

@paleolimbot
Copy link
Member

I can also attest to having to as.data.frame(as.data.frame(arrow_stuff)) and having it be really annoying. This usually comes up in tests where expect_identical() complains or in benchmarks where bench::mark() complains about non-equality. If it was okay for as.data.frame() to return a tibble, then as.data.frame() on a tibble would return a tibble (it does not).

I also do not see the point of lossless roundtrip to/from a file by default. (The option should certainly exist to the extent we have the capacity to support it).

We are as a package in a place where we need to move towards simplicity to reflect the fact that we have very, very, very few contributors. I do not think that having the end result of this PR be "it was too hard so we didn't fix it" is a sustainable path.

@thisisnic
Copy link
Member Author

thisisnic commented Apr 11, 2023

having the end result of this PR be "it was too hard so we didn't fix it" is a sustainable path.

That's not what's being said or done here. I'm separating out the different issues into smaller components. I'm finishing the small and easily solvable one, and pausing working on this one to wait for more discussion, as, as you said, we don't have many contributors, and I feel that we're starting to sacrifice momentum and speed of resolution for what could be modular issues in favour of heading towards yak shaving at the moment.

Issue of variable return type due to argument order moved to #35038

@thisisnic thisisnic closed this Apr 11, 2023
@paleolimbot
Copy link
Member

Apologies for the indirection - splitting off the obvious bug is absolutely the right decision here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants