[C++] Parquet reader is unable to read LargeString columns

### Describe the bug, including details regarding any error messages, version, and platform.

read_parquet() is giving the following error with large parquet files:

> Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180


Versions etc from sessionInfo:
- arrow 14.0.0.2
- R version  4.3.0 (2023-04-21 ucrt)
- Platform: x86_64-w64-ming32/x64
- Windows 11 x64 (build 22621)

Descriptive info on example problematic table, with two columns:
- 140 million rows. 
- id: large_string, 4.2 Gb
- state: int_32, 0.5 Gb 

The id is a hashed string, 24 characters long. It is not practical to change it, as it's the joining key. 

Note, the data above is stored as a data.table in R and left that way when saving it with write_parquet(). But I've converted it to an arrow table for the above descriptive stats, because I thought they'd be more useful to you!


Other relevant information:
- The large parquet files were created with arrow::write_parquet()
- The same files previously opened with an earlier version of read_parquet()
(unfortunately I'm not sure which version, but it was working late November/early December, we work in a closed environment and use Posit Package manager, VMs rebuild every 30 days, so it would have been a fairly recent version)
- I've duplicated the error, and it still occurs with newly created large parquet files, such as the one described above
- Loading the same files with open_dataset() works. However, our team uses targets, which implicitly calls read_parquet, so this bug has unfortunately efffected many of our workflows. 

Note: I haven't been able to roll back to an earlier version of arrow -  because we only have earlier source versions and not binaries and I'm using windows, I get libarrow errors. If there is a work around for this please let me know. 


## UPDATE
With the help of IT, I have been able to install earlier versions of arrow in my environment, and have shown that:

- Bug was introduced in arrow version 14.0.0, and it's the read_parquet function that is the issue, not write_parquet. 
- With version 13.0.0.1 I was able to open the above problematic file, which was created with write_parquet version 14.0.0.2. 


### Component(s)

Parquet, R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C++] Parquet reader is unable to read LargeString columns #39682

Describe the bug, including details regarding any error messages, version, and platform.

UPDATE

Component(s)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++] Parquet reader is unable to read LargeString columns #39682

Description

Describe the bug, including details regarding any error messages, version, and platform.

UPDATE

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions