Skip to content

[Python] Allow constructing InMemoryDataset from RecordBatchReader #46729

@amoeba

Description

@amoeba

Describe the bug, including details regarding any error messages, version, and platform.

The docstring for InMemoryDataset indicate you can create one from a RecordBatchReader:

cdef class InMemoryDataset(Dataset):
"""
A Dataset wrapping in-memory data.
Parameters
----------
source : RecordBatch, Table, list, tuple
The data for this dataset. Can be a RecordBatch, Table, list of
RecordBatch/Table, iterable of RecordBatch, or a RecordBatchReader
If an iterable is provided, the schema must also be provided.

However, if you try this you currently get an error saying you cannot:

>>> ds = ds.InMemoryDataset(rbr)
Traceback (most recent call last):
  File "<python-input-37>", line 1, in <module>
    ds = ds.InMemoryDataset(rbr)
  File "pyarrow/_dataset.pyx", line 1038, in pyarrow._dataset.InMemoryDataset.__init__
TypeError: Expected a table, batch, or list of tables/batches instead of the given type: RecordBatchReader

I don't think we allow simple construction of an InMemoryDataset from a RecordBatchReader because that violates the assumption Datasets about sources being re-readable (not one-shot like RBR). But I don't see why the InMemoryDataset constructor can't consume the RecordBatchReader and construct a Table from it.

Component(s)

Python

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions