-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10386: [R] List column class attributes not preserved in roundtrip #9182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format? See also: |
|
@github-actions crossbow submit test-r-version-compatibility |
039b59e to
d18222d
Compare
|
@github-actions crossbow submit test-r-version-compatibility |
|
Revision: d18222dca3587269aa2abc27ee9033cd352351f4 Submitted crossbow builds: ursa-labs/crossbow @ actions-878
|
…st itself. ARROW-10386.
68d6c47 to
5649500
Compare
r/R/schema.R
Outdated
| #' | ||
| #' @section Metadata: | ||
| #' | ||
| #' Attributes from the `data.frame` are saved alongside tables so that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"When converting a data.frame to an Arrow Table or RecordBatch, "
r/R/schema.R
Outdated
| #' Modify or replace by assigning in (`sch$metadata <- new_metadata`). | ||
| #' All list elements are coerced to string. | ||
| #' | ||
| #' @section Metadata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section should either be called something like "R Metadata", or it should start by discussing the key-value metadata more generally.
r/R/schema.R
Outdated
| #' them when pulled back into R. This metadata is separate from the schema | ||
| #' (e.g. types of the columns) which is compatible with other Arrow clients. | ||
| #' The R metadata is only read by R and is ignored by other clients (e.g. | ||
| #' pyarrow which has its own custom metadata for things like Pandas metadata). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is Pandas only that stores extra metadata, not pyarrow itself.
r/R/schema.R
Outdated
| #' object can be reconstructed faithfully in R (e.g. with `as.data.frame()`). | ||
| #' This metadata can be both at the top-level of the `data.frame` (e.g. | ||
| #' `attributes(df)`) or at the column (e.g. `attributes(df$col_a)`) or element | ||
| #' level (e.g. `attributes(df[1, "col_a"])`). For example, this allows for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the code, this is only true for list columns (which makes sense because regular vectors can't have attributes on elements)
r/R/schema.R
Outdated
| #' level (e.g. `attributes(df[1, "col_a"])`). For example, this allows for | ||
| #' storing `haven` columns in a table and being able to faithfully re-create | ||
| #' them when pulled back into R. This metadata is separate from the schema | ||
| #' (e.g. types of the columns) which is compatible with other Arrow clients. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #' (e.g. types of the columns) which is compatible with other Arrow clients. | |
| #' (column names and types), which is compatible with other Arrow clients. |
r/R/schema.R
Outdated
| #' (e.g. types of the columns) which is compatible with other Arrow clients. | ||
| #' The R metadata is only read by R and is ignored by other clients (e.g. | ||
| #' pyarrow which has its own custom metadata for things like Pandas metadata). | ||
| #' This metadata is stored (and can be accessed with) `table$metadata$r`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't say table here, we're in the Schema docs.
| #' This metadata is stored (and can be accessed with) `table$metadata$r`. | |
| #' This metadata is stored in `$metadata$r`. |
r/R/schema.R
Outdated
| #' include large amounts of metadata) you can set the option | ||
| #' `arrow.compress_metadata` to `FALSE`. | ||
| #' | ||
| #' One exception to storing all metadata: `readr`'s `problems` attribute if it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this paragraph is necessary.
r/R/schema.R
Outdated
| #' pyarrow which has its own custom metadata for things like Pandas metadata). | ||
| #' This metadata is stored (and can be accessed with) `table$metadata$r`. | ||
| #' | ||
| #' This metadata is saved by serializing R's attribute list structure to a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Since Schema metadata keys and values must be strings, ..."
r/R/schema.R
Outdated
| #' serialized string. Because of this, large amounts of metadata can quickly | ||
| #' increase the size of tables (and therefore the size of tables written to | ||
| #' parquet or feather files). If the (serialized) metadata exceeds 100Kbs in | ||
| #' size, it is first compressed before saving. To disable this compression | ||
| #' (e.g. for tables that are compatible with Arrow versions before 3.0.0 and | ||
| #' include large amounts of metadata) you can set the option | ||
| #' `arrow.compress_metadata` to `FALSE`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #' serialized string. Because of this, large amounts of metadata can quickly | |
| #' increase the size of tables (and therefore the size of tables written to | |
| #' parquet or feather files). If the (serialized) metadata exceeds 100Kbs in | |
| #' size, it is first compressed before saving. To disable this compression | |
| #' (e.g. for tables that are compatible with Arrow versions before 3.0.0 and | |
| #' include large amounts of metadata) you can set the option | |
| #' `arrow.compress_metadata` to `FALSE`. | |
| #' string. If the serialized metadata exceeds 100Kbs in size, by default | |
| #' it is compressed starting in version 3.0.0. To disable this compression | |
| #' (e.g. for tables that are compatible with Arrow versions before 3.0.0 and | |
| #' include large amounts of metadata), set the option | |
| #' `arrow.compress_metadata` to `FALSE`. Files with compressed metadata | |
| #' are readable by older versions of arrow, but the metadata is dropped. |
|
Do we have any backwards compat testing with this feature? |
|
No, but I'll make a Jira + work on adding one/some |
|
Also this deserves a NEWS bullet, including a special mention of |
TBH I think we need something in this PR since we're up against the release deadline. Don't need the full spectrum of feather/parquet/compression, just pick one, and make sure that we can read a data.frame, likely with a warning about invalid metadata. |
|
Oops, yeah I just realized I mis-read the notifications on this — I thought this had been merged already. I'll put them here and we can close the (now extraneous) ARROW-11241 |
|
@github-actions crossbow submit test-r-version-compatibility |
|
Revision: a66818d Submitted crossbow builds: ursa-labs/crossbow @ actions-881
|
|
@github-actions crossbow submit test-r-version-compatibility |
|
Revision: 306751f Submitted crossbow builds: ursa-labs/crossbow @ actions-883
|
|
@github-actions crossbow submit test-r-version-compatibility |
|
Revision: fa0041b Submitted crossbow builds: ursa-labs/crossbow @ actions-884
|
No description provided.