-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the bug, including details regarding any error messages, version, and platform.
read_parquet() is giving the following error with large parquet files:
Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180
Versions etc from sessionInfo:
- arrow 14.0.0.2
- R version 4.3.0 (2023-04-21 ucrt)
- Platform: x86_64-w64-ming32/x64
- Windows 11 x64 (build 22621)
Descriptive info on example problematic table, with two columns:
- 140 million rows.
- id: large_string, 4.2 Gb
- state: int_32, 0.5 Gb
The id is a hashed string, 24 characters long. It is not practical to change it, as it's the joining key.
Note, the data above is stored as a data.table in R and left that way when saving it with write_parquet(). But I've converted it to an arrow table for the above descriptive stats, because I thought they'd be more useful to you!
Other relevant information:
- The large parquet files were created with arrow::write_parquet()
- The same files previously opened with an earlier version of read_parquet()
(unfortunately I'm not sure which version, but it was working late November/early December, we work in a closed environment and use Posit Package manager, VMs rebuild every 30 days, so it would have been a fairly recent version) - I've duplicated the error, and it still occurs with newly created large parquet files, such as the one described above
- Loading the same files with open_dataset() works. However, our team uses targets, which implicitly calls read_parquet, so this bug has unfortunately efffected many of our workflows.
Note: I haven't been able to roll back to an earlier version of arrow - because we only have earlier source versions and not binaries and I'm using windows, I get libarrow errors. If there is a work around for this please let me know.
UPDATE
With the help of IT, I have been able to install earlier versions of arrow in my environment, and have shown that:
- Bug was introduced in arrow version 14.0.0, and it's the read_parquet function that is the issue, not write_parquet.
- With version 13.0.0.1 I was able to open the above problematic file, which was created with write_parquet version 14.0.0.2.
Component(s)
Parquet, R