Skip to content

Reading chunked MapArray fails for large variables (but works for smaller) #38513

@slobodan-ilic

Description

@slobodan-ilic

🐞 Describe the Bug

While using pyarrow to handle real-life survey data from our custom database at Crunch.io, we ran into an issue. Writing the data to parquet files works as expected, but reading the data back into a table using pq.read_table triggers an error. The error is dependent on the data size but seems related to nested types.


Platform & Version Info

  • Library: PyArrow
  • Language: Python
  • Environment: MacOS
  • Version: 13.0.0

⚠️ Error Message

Here is the traceback of the encountered error:

[Traceback (most recent call last):
  File "/Users/slobodanilic/Documents/Crunch/repos/crunch-lake/testmap.py", line 61, in <module>
    loaded_map_array = pq.read_table("test.parquet").column(0)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slobodanilic/Documents/Crunch/repos/crunch-lake/venv/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 3002, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slobodanilic/Documents/Crunch/repos/crunch-lake/venv/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 2630, in read
    table = self._dataset.to_table(
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 556, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3638, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs]

📄 Reproducible Code

Run the code as-is with 200K rows to reproduce the error. Reduce the row count to 100K, and it should work.

"""
Test writing/reading large map array in chunks.

This example demonstrates an issue when trying to encode real-life survey data results
into a map-array structure in pyarrow and saving it into a parquet file. Reading it back
raises an error: `Nested data conversions not implemented for chunked array outputs`.
"""

from typing import List
import numpy as np
from numpy import ndarray
import pyarrow as pa
import pyarrow.parquet as pq

# Parameters
N_ROWS: int = 200000  # changing this to 100K will make the example work
N_COLS: int = 600
SPARSITY: float = 0.5
CHUNK_SIZE: int = 10000

# Calculate sparsity-affected column size
N_COLS_W_VALUES: int = int(N_COLS * SPARSITY)

# Generate "column" names (or keys in MapArray context)
subrefs: List[str] = [
    f"really_really_really_long_column_name_for_a_subreference_{i}"
    for i in range(N_COLS)
]

# Generate an index array for column names
all_subrefs_inds: ndarray = np.arange(N_COLS)

# Generate actual data (random indices) for each row/column combination
subvar_indexes: ndarray = np.array(
    [
        np.random.choice(all_subrefs_inds, size=N_COLS_W_VALUES, replace=False)
        for _ in range(N_ROWS)
    ]
).ravel()

# Generate random values between 1 and 10 for each row/column combination
values: ndarray = np.random.randint(1, 11, size=(N_ROWS, N_COLS_W_VALUES)).ravel()

# Generate offsets for each row
offsets: ndarray = np.linspace(0, N_ROWS * N_COLS_W_VALUES, N_ROWS + 1, dtype=int)

# Create DictionaryArray for keys and MapArray for the map structure
keys = pa.DictionaryArray.from_arrays(pa.array(subvar_indexes), subrefs)
map_array = pa.chunked_array(
    [
        pa.MapArray.from_arrays(offsets[i : i + CHUNK_SIZE + 1], keys, pa.array(values))
        for i in range(0, len(offsets) - 1, CHUNK_SIZE)
    ]
)

# Write and read the table
print("Writing table")
tbl = pa.Table.from_arrays([map_array], names=["map_array"])
pq.write_table(tbl, "test.parquet")

print("Reading table")
loaded_map_array = pq.read_table("test.parquet").column(0)

print("Successfully read the table from parquet and loaded into pyarrow.")

🏷 Component(s)

  • Parquet
  • Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions