Skip to content

[PARQUET] Segmentation fault when writing empty RecordBatches #2951

@suvayu

Description

@suvayu

I have a curious problem. I am trying to convert a very sparse dataset to parquet (~3% rows in a range are populated). The file I am working with spans upto ~63M rows. I decided to iterate in batches of 500k rows, 127 batches in total. Each row batch is a RecordBatch. I create 4 batches at a time, and write to a parquet file incrementally. Something like this:

batches = [..]  # 4 batches
tbl = pa.Table.from_batches(batches)
# btw, is there a guideline on how to choose row_group_size?
pqwriter.write_table(tbl, row_group_size=15000)
# same issue with pq.write_table(..)

I was getting a segmentation fault, I narrowed it down to a specific iteration. I noticed that iteration had empty batches; specifically, [0, 0, 2876, 14423]. The number of rows for each RecordBatch for the whole dataset is below:

[14050, 16398, 14080, 14920, 15527, 14288, 15040, 14733, 15345, 15799,
15728, 15942, 14734, 15241, 15721, 15255, 14167, 14009, 13753, 14800,
14554, 14287, 15393, 14766, 16600, 15675, 14072, 13263, 12906, 14167,
14455, 15428, 15129, 16141, 15478, 16257, 14639, 14887, 14919, 15535,
13973, 14334, 13286, 15038, 15951, 17252, 15883, 19903, 16967, 16878,
15845, 12205, 8761, 0, 0, 0, 0, 0, 2876, 14423, 13557, 12723, 14330,
15452, 13551, 12723, 12396, 13531, 13539, 11512, 13175, 13941, 14634,
15515, 14239, 13856, 13873, 14154, 14822, 13543, 14653, 15328, 16171,
15101, 150 55, 15194, 14058, 13706, 14747, 14650, 14694, 15397, 15122,
16055, 16635, 14153, 14665, 14781, 15462, 15426, 16150, 14632, 14532,
15139, 15324, 15279, 16075, 16394, 16834, 15391, 16320, 1650 4, 17248,
15913, 15341, 14754, 16637, 15695, 16642, 18143, 19481, 19072, 15742,
18807, 18789, 14258, 0, 0]

On excluding the empty RecordBatch-es, the segfault goes away, but unfortunately I can't work-up a minimal example; e.g. I tried the following:

import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

batch1 = [pa.array([], type=pa.int8()),
          pa.array([], type=pa.float32())]
batch2 = [pa.array([i for i in range(10)], type=pa.int8()),
          pa.array(np.random.rand(10), type=pa.float32())]

names = ['i', 'f']
batches = [pa.RecordBatch.from_arrays(batch1, names),
           pa.RecordBatch.from_arrays(batch2, names)]

tbl = pa.Table.from_batches(batches)
pq.write_table(tbl, '/tmp/test.pq')

But this works!

I have tried attaching gdb, the backtrace when the segfault occurs is shown below (maybe it helps, this is how I realised empty batches could be the reason).

(gdb) bt
#0  0x00007f3e7676d670 in parquet::TypedColumnWriter<parquet::DataType<(parquet::Type::type)6> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray const*) ()
   from /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#1  0x00007f3e76733d1e in arrow::Status parquet::arrow::(anonymous namespace)::ArrowColumnWriter::TypedWriteBatch<parquet::DataType<(parquet::Type::type)6>, arrow::BinaryType>(arrow::Array const&, long, short const*, short const*) ()
   from /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#2  0x00007f3e7673a3d4 in parquet::arrow::(anonymous namespace)::ArrowColumnWriter::Write(arrow::Array const&) ()
   from /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#3  0x00007f3e7673df09 in parquet::arrow::FileWriter::Impl::WriteColumnChunk(std::shared_ptr<arrow::ChunkedArray> const&, long, long) ()
   from /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#4  0x00007f3e7673c74d in parquet::arrow::FileWriter::WriteColumnChunk(std::shared_ptr<arrow::ChunkedArray> const&, long, long) ()
   from /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#5  0x00007f3e7673c8d2 in parquet::arrow::FileWriter::WriteTable(arrow::Table const&, long) ()
   from /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#6  0x00007f3e731e3a51 in __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, _object*) ()
   from /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/_parquet.cpython-36m-x86_64-linux-gnu.so

Is this sufficient info to file a bug report? If reproducing this seems difficult, I can share more details about my environment (both data and code are public).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions