Skip to content

Conversation

@jonkeane
Copy link
Member

No description provided.

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@jonkeane
Copy link
Member Author

@github-actions crossbow submit test-r-version-compatibility

@jonkeane jonkeane changed the title [WIP] ARROW-10386: [R] List column class attributes not preserved in roundtrip ARROW-10386: [R] List column class attributes not preserved in roundtrip [WIP] Jan 13, 2021
@github-actions
Copy link

@jonkeane jonkeane force-pushed the ARROW-10386/List_metadata branch from 039b59e to d18222d Compare January 13, 2021 00:53
@jonkeane
Copy link
Member Author

@github-actions crossbow submit test-r-version-compatibility

@github-actions
Copy link

Revision: d18222dca3587269aa2abc27ee9033cd352351f4

Submitted crossbow builds: ursa-labs/crossbow @ actions-878

Task Status
test-r-version-compatibility Github Actions

@jonkeane jonkeane force-pushed the ARROW-10386/List_metadata branch from 68d6c47 to 5649500 Compare January 13, 2021 17:46
@jonkeane jonkeane changed the title ARROW-10386: [R] List column class attributes not preserved in roundtrip [WIP] ARROW-10386: [R] List column class attributes not preserved in roundtrip Jan 13, 2021
r/R/schema.R Outdated
#'
#' @section Metadata:
#'
#' Attributes from the `data.frame` are saved alongside tables so that the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"When converting a data.frame to an Arrow Table or RecordBatch, "

r/R/schema.R Outdated
#' Modify or replace by assigning in (`sch$metadata <- new_metadata`).
#' All list elements are coerced to string.
#'
#' @section Metadata:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section should either be called something like "R Metadata", or it should start by discussing the key-value metadata more generally.

r/R/schema.R Outdated
#' them when pulled back into R. This metadata is separate from the schema
#' (e.g. types of the columns) which is compatible with other Arrow clients.
#' The R metadata is only read by R and is ignored by other clients (e.g.
#' pyarrow which has its own custom metadata for things like Pandas metadata).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is Pandas only that stores extra metadata, not pyarrow itself.

r/R/schema.R Outdated
#' object can be reconstructed faithfully in R (e.g. with `as.data.frame()`).
#' This metadata can be both at the top-level of the `data.frame` (e.g.
#' `attributes(df)`) or at the column (e.g. `attributes(df$col_a)`) or element
#' level (e.g. `attributes(df[1, "col_a"])`). For example, this allows for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the code, this is only true for list columns (which makes sense because regular vectors can't have attributes on elements)

r/R/schema.R Outdated
#' level (e.g. `attributes(df[1, "col_a"])`). For example, this allows for
#' storing `haven` columns in a table and being able to faithfully re-create
#' them when pulled back into R. This metadata is separate from the schema
#' (e.g. types of the columns) which is compatible with other Arrow clients.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' (e.g. types of the columns) which is compatible with other Arrow clients.
#' (column names and types), which is compatible with other Arrow clients.

r/R/schema.R Outdated
#' (e.g. types of the columns) which is compatible with other Arrow clients.
#' The R metadata is only read by R and is ignored by other clients (e.g.
#' pyarrow which has its own custom metadata for things like Pandas metadata).
#' This metadata is stored (and can be accessed with) `table$metadata$r`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't say table here, we're in the Schema docs.

Suggested change
#' This metadata is stored (and can be accessed with) `table$metadata$r`.
#' This metadata is stored in `$metadata$r`.

r/R/schema.R Outdated
#' include large amounts of metadata) you can set the option
#' `arrow.compress_metadata` to `FALSE`.
#'
#' One exception to storing all metadata: `readr`'s `problems` attribute if it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this paragraph is necessary.

r/R/schema.R Outdated
#' pyarrow which has its own custom metadata for things like Pandas metadata).
#' This metadata is stored (and can be accessed with) `table$metadata$r`.
#'
#' This metadata is saved by serializing R's attribute list structure to a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Since Schema metadata keys and values must be strings, ..."

r/R/schema.R Outdated
Comment on lines 69 to 75
#' serialized string. Because of this, large amounts of metadata can quickly
#' increase the size of tables (and therefore the size of tables written to
#' parquet or feather files). If the (serialized) metadata exceeds 100Kbs in
#' size, it is first compressed before saving. To disable this compression
#' (e.g. for tables that are compatible with Arrow versions before 3.0.0 and
#' include large amounts of metadata) you can set the option
#' `arrow.compress_metadata` to `FALSE`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' serialized string. Because of this, large amounts of metadata can quickly
#' increase the size of tables (and therefore the size of tables written to
#' parquet or feather files). If the (serialized) metadata exceeds 100Kbs in
#' size, it is first compressed before saving. To disable this compression
#' (e.g. for tables that are compatible with Arrow versions before 3.0.0 and
#' include large amounts of metadata) you can set the option
#' `arrow.compress_metadata` to `FALSE`.
#' string. If the serialized metadata exceeds 100Kbs in size, by default
#' it is compressed starting in version 3.0.0. To disable this compression
#' (e.g. for tables that are compatible with Arrow versions before 3.0.0 and
#' include large amounts of metadata), set the option
#' `arrow.compress_metadata` to `FALSE`. Files with compressed metadata
#' are readable by older versions of arrow, but the metadata is dropped.

@nealrichardson
Copy link
Member

Do we have any backwards compat testing with this feature?

@jonkeane
Copy link
Member Author

No, but I'll make a Jira + work on adding one/some

@nealrichardson
Copy link
Member

Also this deserves a NEWS bullet, including a special mention of sf data.

@nealrichardson
Copy link
Member

No, but I'll make a Jira + work on adding one/some

TBH I think we need something in this PR since we're up against the release deadline. Don't need the full spectrum of feather/parquet/compression, just pick one, and make sure that we can read a data.frame, likely with a warning about invalid metadata.

@jonkeane
Copy link
Member Author

Oops, yeah I just realized I mis-read the notifications on this — I thought this had been merged already. I'll put them here and we can close the (now extraneous) ARROW-11241

@jonkeane
Copy link
Member Author

@github-actions crossbow submit test-r-version-compatibility

@github-actions
Copy link

Revision: a66818d

Submitted crossbow builds: ursa-labs/crossbow @ actions-881

Task Status
test-r-version-compatibility Github Actions

@jonkeane
Copy link
Member Author

@github-actions crossbow submit test-r-version-compatibility

@github-actions
Copy link

Revision: 306751f

Submitted crossbow builds: ursa-labs/crossbow @ actions-883

Task Status
test-r-version-compatibility Github Actions

@jonkeane
Copy link
Member Author

@github-actions crossbow submit test-r-version-compatibility

@github-actions
Copy link

Revision: fa0041b

Submitted crossbow builds: ursa-labs/crossbow @ actions-884

Task Status
test-r-version-compatibility Github Actions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants