-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Description
🐞 Describe the Bug
While using pyarrow to handle real-life survey data from our custom database at Crunch.io, we ran into an issue. Writing the data to parquet files works as expected, but reading the data back into a table using pq.read_table triggers an error. The error is dependent on the data size but seems related to nested types.
Platform & Version Info
- Library: PyArrow
- Language: Python
- Environment: MacOS
- Version: 13.0.0
⚠️ Error Message
Here is the traceback of the encountered error:
[Traceback (most recent call last):
File "/Users/slobodanilic/Documents/Crunch/repos/crunch-lake/testmap.py", line 61, in <module>
loaded_map_array = pq.read_table("test.parquet").column(0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/slobodanilic/Documents/Crunch/repos/crunch-lake/venv/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 3002, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/slobodanilic/Documents/Crunch/repos/crunch-lake/venv/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 2630, in read
table = self._dataset.to_table(
^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_dataset.pyx", line 556, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 3638, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs]📄 Reproducible Code
Run the code as-is with 200K rows to reproduce the error. Reduce the row count to 100K, and it should work.
"""
Test writing/reading large map array in chunks.
This example demonstrates an issue when trying to encode real-life survey data results
into a map-array structure in pyarrow and saving it into a parquet file. Reading it back
raises an error: `Nested data conversions not implemented for chunked array outputs`.
"""
from typing import List
import numpy as np
from numpy import ndarray
import pyarrow as pa
import pyarrow.parquet as pq
# Parameters
N_ROWS: int = 200000 # changing this to 100K will make the example work
N_COLS: int = 600
SPARSITY: float = 0.5
CHUNK_SIZE: int = 10000
# Calculate sparsity-affected column size
N_COLS_W_VALUES: int = int(N_COLS * SPARSITY)
# Generate "column" names (or keys in MapArray context)
subrefs: List[str] = [
f"really_really_really_long_column_name_for_a_subreference_{i}"
for i in range(N_COLS)
]
# Generate an index array for column names
all_subrefs_inds: ndarray = np.arange(N_COLS)
# Generate actual data (random indices) for each row/column combination
subvar_indexes: ndarray = np.array(
[
np.random.choice(all_subrefs_inds, size=N_COLS_W_VALUES, replace=False)
for _ in range(N_ROWS)
]
).ravel()
# Generate random values between 1 and 10 for each row/column combination
values: ndarray = np.random.randint(1, 11, size=(N_ROWS, N_COLS_W_VALUES)).ravel()
# Generate offsets for each row
offsets: ndarray = np.linspace(0, N_ROWS * N_COLS_W_VALUES, N_ROWS + 1, dtype=int)
# Create DictionaryArray for keys and MapArray for the map structure
keys = pa.DictionaryArray.from_arrays(pa.array(subvar_indexes), subrefs)
map_array = pa.chunked_array(
[
pa.MapArray.from_arrays(offsets[i : i + CHUNK_SIZE + 1], keys, pa.array(values))
for i in range(0, len(offsets) - 1, CHUNK_SIZE)
]
)
# Write and read the table
print("Writing table")
tbl = pa.Table.from_arrays([map_array], names=["map_array"])
pq.write_table(tbl, "test.parquet")
print("Reading table")
loaded_map_array = pq.read_table("test.parquet").column(0)
print("Successfully read the table from parquet and loaded into pyarrow.")🏷 Component(s)
- Parquet
- Python
enkidulan and slobodan-ilic