ARROW-10386: [R] List column class attributes not preserved in roundtrip #9182

jonkeane · 2021-01-13T00:09:24Z

No description provided.

github-actions · 2021-01-13T00:09:50Z

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

jonkeane · 2021-01-13T00:11:56Z

@github-actions crossbow submit test-r-version-compatibility

github-actions · 2021-01-13T00:15:25Z

https://issues.apache.org/jira/browse/ARROW-10386

jonkeane · 2021-01-13T00:54:36Z

@github-actions crossbow submit test-r-version-compatibility

github-actions · 2021-01-13T00:55:25Z

Revision: d18222dca3587269aa2abc27ee9033cd352351f4

Submitted crossbow builds: ursa-labs/crossbow @ actions-878

Task	Status
test-r-version-compatibility

…st itself. ARROW-10386.

nealrichardson · 2021-01-13T18:28:19Z

r/R/schema.R

+#'
+#' @section Metadata:
+#'
+#'   Attributes from the `data.frame` are saved alongside tables so that the


"When converting a data.frame to an Arrow Table or RecordBatch, "

nealrichardson · 2021-01-13T18:29:05Z

r/R/schema.R

 #'    Modify or replace by assigning in (`sch$metadata <- new_metadata`).
 #'    All list elements are coerced to string.
+#'
+#' @section Metadata:


This section should either be called something like "R Metadata", or it should start by discussing the key-value metadata more generally.

nealrichardson · 2021-01-13T18:29:45Z

r/R/schema.R

+#'   them when pulled back into R. This metadata is separate from the schema
+#'   (e.g. types of the columns) which is compatible with other Arrow clients.
+#'   The R metadata is only read by R and is ignored by other clients (e.g.
+#'   pyarrow which has its own custom metadata for things like Pandas metadata).


I believe it is Pandas only that stores extra metadata, not pyarrow itself.

nealrichardson · 2021-01-13T18:31:02Z

r/R/schema.R

+#'   object can be reconstructed faithfully in R (e.g. with `as.data.frame()`).
+#'   This metadata can be both at the top-level of the `data.frame` (e.g.
+#'   `attributes(df)`) or at the column (e.g. `attributes(df$col_a)`) or element
+#'   level (e.g. `attributes(df[1, "col_a"])`). For example, this allows for


According to the code, this is only true for list columns (which makes sense because regular vectors can't have attributes on elements)

nealrichardson · 2021-01-13T18:31:35Z

r/R/schema.R

+#'   level (e.g. `attributes(df[1, "col_a"])`). For example, this allows for
+#'   storing `haven` columns in a table and being able to faithfully re-create
+#'   them when pulled back into R. This metadata is separate from the schema
+#'   (e.g. types of the columns) which is compatible with other Arrow clients.


Suggested change

#' (e.g. types of the columns) which is compatible with other Arrow clients.

#' (column names and types), which is compatible with other Arrow clients.

nealrichardson · 2021-01-13T18:32:10Z

r/R/schema.R

+#'   (e.g. types of the columns) which is compatible with other Arrow clients.
+#'   The R metadata is only read by R and is ignored by other clients (e.g.
+#'   pyarrow which has its own custom metadata for things like Pandas metadata).
+#'   This metadata is stored (and can be accessed with) `table$metadata$r`.


Shouldn't say table here, we're in the Schema docs.

Suggested change

#' This metadata is stored (and can be accessed with) `table$metadata$r`.

#' This metadata is stored in `$metadata$r`.

nealrichardson · 2021-01-13T18:32:41Z

r/R/schema.R

+#'   include large amounts of metadata) you can set the option
+#'   `arrow.compress_metadata` to `FALSE`.
+#'
+#'   One exception to storing all metadata: `readr`'s `problems` attribute if it


I don't think this paragraph is necessary.

nealrichardson · 2021-01-13T18:33:34Z

r/R/schema.R

+#'   pyarrow which has its own custom metadata for things like Pandas metadata).
+#'   This metadata is stored (and can be accessed with) `table$metadata$r`.
+#'
+#'   This metadata is saved by serializing R's attribute list structure to a


"Since Schema metadata keys and values must be strings, ..."

nealrichardson · 2021-01-13T18:40:06Z

r/R/schema.R

+#'   serialized string. Because of this, large amounts of metadata can quickly
+#'   increase the size of tables (and therefore the size of tables written to
+#'   parquet or feather files). If the (serialized) metadata exceeds 100Kbs in
+#'   size, it is first compressed before saving. To disable this compression
+#'   (e.g. for tables that are compatible with Arrow versions before 3.0.0 and
+#'   include large amounts of metadata) you can set the option
+#'   `arrow.compress_metadata` to `FALSE`.


Suggested change

#' serialized string. Because of this, large amounts of metadata can quickly

#' increase the size of tables (and therefore the size of tables written to

#' parquet or feather files). If the (serialized) metadata exceeds 100Kbs in

#' size, it is first compressed before saving. To disable this compression

#' (e.g. for tables that are compatible with Arrow versions before 3.0.0 and

#' include large amounts of metadata) you can set the option

#' `arrow.compress_metadata` to `FALSE`.

#' string. If the serialized metadata exceeds 100Kbs in size, by default

#' it is compressed starting in version 3.0.0. To disable this compression

#' (e.g. for tables that are compatible with Arrow versions before 3.0.0 and

#' include large amounts of metadata), set the option

#' `arrow.compress_metadata` to `FALSE`. Files with compressed metadata

#' are readable by older versions of arrow, but the metadata is dropped.

nealrichardson · 2021-01-13T20:04:49Z

Do we have any backwards compat testing with this feature?

jonkeane · 2021-01-13T20:06:54Z

No, but I'll make a Jira + work on adding one/some

nealrichardson · 2021-01-13T20:21:00Z

Also this deserves a NEWS bullet, including a special mention of sf data.

nealrichardson · 2021-01-13T20:58:02Z

No, but I'll make a Jira + work on adding one/some

TBH I think we need something in this PR since we're up against the release deadline. Don't need the full spectrum of feather/parquet/compression, just pick one, and make sure that we can read a data.frame, likely with a warning about invalid metadata.

jonkeane · 2021-01-13T20:59:33Z

Oops, yeah I just realized I mis-read the notifications on this — I thought this had been merged already. I'll put them here and we can close the (now extraneous) ARROW-11241

jonkeane · 2021-01-13T21:12:56Z

@github-actions crossbow submit test-r-version-compatibility

github-actions · 2021-01-13T21:18:45Z

Revision: a66818d

Submitted crossbow builds: ursa-labs/crossbow @ actions-881

Task	Status
test-r-version-compatibility

jonkeane · 2021-01-13T21:41:58Z

@github-actions crossbow submit test-r-version-compatibility

github-actions · 2021-01-13T21:46:26Z

Revision: 306751f

Submitted crossbow builds: ursa-labs/crossbow @ actions-883

Task	Status
test-r-version-compatibility

jonkeane · 2021-01-13T22:07:16Z

@github-actions crossbow submit test-r-version-compatibility

github-actions · 2021-01-13T22:09:16Z

Revision: fa0041b

Submitted crossbow builds: ursa-labs/crossbow @ actions-884

Task	Status
test-r-version-compatibility

github-actions bot added the Component: R label Jan 13, 2021

jonkeane changed the title ~~[WIP] ARROW-10386: [R] List column class attributes not preserved in roundtrip~~ ARROW-10386: [R] List column class attributes not preserved in roundtrip [WIP] Jan 13, 2021

jonkeane force-pushed the ARROW-10386/List_metadata branch from 039b59e to d18222d Compare January 13, 2021 00:53

romainfrancois and others added 7 commits January 13, 2021 11:43

store metadata for each element of a list column too, not just the li…

05bf857

…st itself. ARROW-10386.

update test

0c6065a

Slight clarification on test

57f05e2

Try some compression

a92ed0d

Oops, attributes must be lists.

95aaa30

Add option for disabling compression

6fd2d35

Updated documentation

5649500

jonkeane force-pushed the ARROW-10386/List_metadata branch from 68d6c47 to 5649500 Compare January 13, 2021 17:46

jonkeane changed the title ~~ARROW-10386: [R] List column class attributes not preserved in roundtrip [WIP]~~ ARROW-10386: [R] List column class attributes not preserved in roundtrip Jan 13, 2021

CI bump

92fa1f3

nealrichardson requested changes Jan 13, 2021

View reviewed changes

github-actions bot added the Component: Parquet label Jan 13, 2021

PR comments

82679fa

nealrichardson mentioned this pull request Jan 13, 2021

ARROW-10386 [R]: List column class attributes not preserved in roundtrip #8549

Closed

nealrichardson approved these changes Jan 13, 2021

View reviewed changes

jonkeane added 2 commits January 13, 2021 15:01

📰

920cfb1

add extra-tests for compressed metadata

a66818d

expect warning for compressed metadata prior to 3.0.0

306751f

backwards compatibility + fixed = TRUE

fa0041b

nealrichardson closed this in 6deb892 Jan 13, 2021

jonkeane deleted the ARROW-10386/List_metadata branch May 5, 2021 12:53

This was referenced Apr 26, 2021

[R] List column class attributes not preserved in roundtrip #26370

Closed

[R] Disable row-level metadata application on datasets #28879

Closed

	#' (e.g. types of the columns) which is compatible with other Arrow clients.
	#' (column names and types), which is compatible with other Arrow clients.

	#' This metadata is stored (and can be accessed with) `table$metadata$r`.
	#' This metadata is stored in `$metadata$r`.

ARROW-10386: [R] List column class attributes not preserved in roundtrip #9182

ARROW-10386: [R] List column class attributes not preserved in roundtrip #9182

Uh oh!

Conversation

jonkeane commented Jan 13, 2021

Uh oh!

github-actions bot commented Jan 13, 2021

Uh oh!

jonkeane commented Jan 13, 2021

Uh oh!

github-actions bot commented Jan 13, 2021

Uh oh!

jonkeane commented Jan 13, 2021

Uh oh!

github-actions bot commented Jan 13, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nealrichardson commented Jan 13, 2021

Uh oh!

jonkeane commented Jan 13, 2021

Uh oh!

nealrichardson commented Jan 13, 2021

Uh oh!

nealrichardson commented Jan 13, 2021

Uh oh!

jonkeane commented Jan 13, 2021

Uh oh!

jonkeane commented Jan 13, 2021

Uh oh!

github-actions bot commented Jan 13, 2021

Uh oh!

jonkeane commented Jan 13, 2021

Uh oh!

github-actions bot commented Jan 13, 2021

Uh oh!

jonkeane commented Jan 13, 2021

Uh oh!

github-actions bot commented Jan 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants