-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Labels
Description
Describe the bug, including details regarding any error messages, version, and platform.
When we write the parquet using arrow, we lost information that the data frame is data.frame.
arrow::write_parquet(iris, "iris.parquet")
When we read iris.parquet it's read as tbl by default.
arrow::read_parquet("iris.parquet")
class(arrow::read_parquet("iris.parquet"))
# [1] "tbl_df" "tbl" "data.frame"This bug was introduced in #34775
The class data.frame is removed in .serialize_arrow_r_metadata function.
https://github.com/apache/arrow/blame/7ef5437e23bd7d7571a0c7a7fc0c5d3634816802/r/R/metadata.R#L25
prop <- arrow::ParquetFileReader$create(
"iris.parquet",
props = arrow::ParquetArrowReaderProperties$create()
)
prop$GetSchema()$metadata$r$attributes$class
# NULLWhen the parquet is saved, the attributes were removed.
Workaround:
- Apply extra class to
data.frame
new_iris <- iris
class(new_iris) <- c("custom.data.frame", "data.frame")
arrow::write_parquet(new_iris, "iris.parquet")
prop <- arrow::ParquetFileReader$create(
"iris.parquet",
props = arrow::ParquetArrowReaderProperties$create()
)
prop$GetSchema()$metadata$r$attributes$class
# [1] "custom.data.frame" "data.frame" - Apply the metadata properties manually
# Convert the data.frame to an Arrow Table
iris_table <- arrow::Table$create(iris)
# Retrieve existing metadata (if any)
existing_metadata <- iris_table$schema$metadata
# Add or update the 'class' metadata to indicate it's a data.frame
new_metadata <- existing_metadata
new_metadata$r$attributes$class <- c("data.frame", "custom.data.frame")
# Update the schema with the new metadata
iris_table <- iris_table$ReplaceSchemaMetadata(new_metadata)
arrow::write_parquet(iris_table, "iris.parquet")
Bug description:
When we remove data.frame class attribute, we read parquet by default as tibble. In my opinion it's not expected behavior as when we write data.frame we should read data.frame.
What was the reason for remove the class if it's just data.frame?
Component(s)
R