Skip to content

[R] When write parquet as data.frame we lost information about class. #44524

@galachad

Description

@galachad

Describe the bug, including details regarding any error messages, version, and platform.

When we write the parquet using arrow, we lost information that the data frame is data.frame.

arrow::write_parquet(iris, "iris.parquet")

When we read iris.parquet it's read as tbl by default.

arrow::read_parquet("iris.parquet")
class(arrow::read_parquet("iris.parquet"))
# [1] "tbl_df"     "tbl"        "data.frame"

This bug was introduced in #34775

The class data.frame is removed in .serialize_arrow_r_metadata function.
https://github.com/apache/arrow/blame/7ef5437e23bd7d7571a0c7a7fc0c5d3634816802/r/R/metadata.R#L25

prop <- arrow::ParquetFileReader$create(
  "iris.parquet",
  props = arrow::ParquetArrowReaderProperties$create()
)
prop$GetSchema()$metadata$r$attributes$class
# NULL

When the parquet is saved, the attributes were removed.

Workaround:

  1. Apply extra class to data.frame
new_iris <- iris
class(new_iris) <- c("custom.data.frame", "data.frame")
arrow::write_parquet(new_iris, "iris.parquet")
prop <- arrow::ParquetFileReader$create(
  "iris.parquet",
  props = arrow::ParquetArrowReaderProperties$create()
)
prop$GetSchema()$metadata$r$attributes$class
# [1] "custom.data.frame" "data.frame"    
  1. Apply the metadata properties manually
# Convert the data.frame to an Arrow Table
iris_table <- arrow::Table$create(iris)

# Retrieve existing metadata (if any)
existing_metadata <- iris_table$schema$metadata

# Add or update the 'class' metadata to indicate it's a data.frame
new_metadata <- existing_metadata
new_metadata$r$attributes$class <- c("data.frame", "custom.data.frame")

# Update the schema with the new metadata
iris_table <- iris_table$ReplaceSchemaMetadata(new_metadata)

arrow::write_parquet(iris_table, "iris.parquet")

Bug description:

When we remove data.frame class attribute, we read parquet by default as tibble. In my opinion it's not expected behavior as when we write data.frame we should read data.frame.

What was the reason for remove the class if it's just data.frame?

Component(s)

R

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions