Skip to content

[C++] Parquet reader is unable to read LargeString columns #39682

@nicki-dese

Description

@nicki-dese

Describe the bug, including details regarding any error messages, version, and platform.

read_parquet() is giving the following error with large parquet files:

Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

Versions etc from sessionInfo:

  • arrow 14.0.0.2
  • R version 4.3.0 (2023-04-21 ucrt)
  • Platform: x86_64-w64-ming32/x64
  • Windows 11 x64 (build 22621)

Descriptive info on example problematic table, with two columns:

  • 140 million rows.
  • id: large_string, 4.2 Gb
  • state: int_32, 0.5 Gb

The id is a hashed string, 24 characters long. It is not practical to change it, as it's the joining key.

Note, the data above is stored as a data.table in R and left that way when saving it with write_parquet(). But I've converted it to an arrow table for the above descriptive stats, because I thought they'd be more useful to you!

Other relevant information:

  • The large parquet files were created with arrow::write_parquet()
  • The same files previously opened with an earlier version of read_parquet()
    (unfortunately I'm not sure which version, but it was working late November/early December, we work in a closed environment and use Posit Package manager, VMs rebuild every 30 days, so it would have been a fairly recent version)
  • I've duplicated the error, and it still occurs with newly created large parquet files, such as the one described above
  • Loading the same files with open_dataset() works. However, our team uses targets, which implicitly calls read_parquet, so this bug has unfortunately efffected many of our workflows.

Note: I haven't been able to roll back to an earlier version of arrow - because we only have earlier source versions and not binaries and I'm using windows, I get libarrow errors. If there is a work around for this please let me know.

UPDATE

With the help of IT, I have been able to install earlier versions of arrow in my environment, and have shown that:

  • Bug was introduced in arrow version 14.0.0, and it's the read_parquet function that is the issue, not write_parquet.
  • With version 13.0.0.1 I was able to open the above problematic file, which was created with write_parquet version 14.0.0.2.

Component(s)

Parquet, R

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions