ARROW-661: [C++] Add LargeRecordBatch metadata type, IPC support, associated refactoring #404

wesm · 2017-03-19T22:48:51Z

This patch enables the following code for writing record batches exceeding 2^31 - 1

RETURN_NOT_OK(WriteLargeRecordBatch(
    batch, buffer_offset, mmap_.get(), &metadata_length, &body_length, pool_));
return ReadLargeRecordBatch(batch.schema(), 0, mmap_.get(), result);

This also does a fair amount of refactoring and code consolidation related to ongoing code cleaning in arrow_ipc.

These APIs are marked experimental. This does add LargeRecordBatch flatbuffer type to the Message union, but I've indicated that Arrow implementations (e.g. Java) are not required to implement this type. It's strictly to enable C++ users to write very large datasets that have been embedded for convenience in Arrow's structured data model.

cc @pcmoritz @robertnishihara

Change-Id: I33cfa0c572a74920666216351391e2c887ba45ca

…s for IPC metadata and convert to flatbuffers later Change-Id: Ia464c912d41afd4f68d28e8a58cf9fc8ea5f8797

Change-Id: I8fed2c84cc2e1c38ea04590487f980efacf306c4

… record batch read/write path Change-Id: I29604a1ed32e8598d39a2ceb632578e5fb8512b8

Change-Id: I340cb6744d883c361b64d090133773c084586ccf

Change-Id: I652fbd1cf1859b77e5efb96e0cf9a10d8c8cb8d4

… aligned bitmaps Change-Id: I92703a6bb6a6d20876efe9fac2844b9b15f4ba12

Change-Id: Ifd0181a256aabb24625d94cd01b5fb8e681e9ba7

Change-Id: I9fe7248278965688c96a48917191d1e2c7b0fb9f

wesm · 2017-03-19T23:29:39Z

Java builds are failing intermittently due to Maven Central flakiness

xhochy · 2017-03-20T09:47:18Z

Green build: https://travis-ci.org/wesm/arrow/builds/212816101

xhochy

+1, LGTM

wesm · 2017-03-20T13:40:12Z

@xhochy thanks for merging this patch. Since this changed Message.fbs, want to make sure that @julienledem takes a look and understands the issue.

It would be good to provide a reasonable guarantee that data stored in a RecordBatch can be read by all Arrow implementations. So the two ways to solve this issue could have been:

Change lengths in FieldNode and RecordBatch from int to long
OR add the LargeFieldNode and LargeRecordBatch types

Since this is marked experimental we aren't making any forward or API compatibility guarantees on thsi functionality

julienledem · 2017-03-20T17:42:13Z

This looks fine. If the feature becomes real we should consider just changing the length field to long in FIeldNode and specify the restriction in the metadata that supporting length > 2^31 - 1 is optional.

wesm · 2017-03-20T18:01:49Z

If others feel that would be acceptable, maintaining less code is always preferable from my perspective. I believe the requirement to store vectors with more than INT32_MAX elements on the C++/Python side is not going to go away.

The downside is that if you encounter a RecordBatch in the wild in Java, you may get an exception if it's too big. I'm not sure how concerning that is.

julienledem · 2017-03-21T16:45:07Z

I'd recommend that the C++ side does not allow writing vectors with more than INT32_MAX entries by default. You'd have to explicitly enable it. This way people don't inadvertently create things that won't be cross-language compatible.

wesm · 2017-03-22T00:08:10Z

I'm OK with that. I will open a JIRA about changing the RecordBatch types from int to long

wesm · 2017-03-22T00:10:24Z

see ARROW-679

wesm added 8 commits March 18, 2017 21:47

Split adapter.h/cc into reader.h/writer.h. Draft LargeRecordBatch type

e8f8973

Change-Id: I33cfa0c572a74920666216351391e2c887ba45ca

Consolidate metadata-internal.h into metadata.h. Use own Arrow struct…

0f2722c

…s for IPC metadata and convert to flatbuffers later Change-Id: Ia464c912d41afd4f68d28e8a58cf9fc8ea5f8797

Add (untested) metadata writer for LargeRecordBatch

85d1a1c

Change-Id: I8fed2c84cc2e1c38ea04590487f980efacf306c4

Consolidate ipc-metadata-test and ipc-read-write-test and draft large…

f4c8830

… record batch read/write path Change-Id: I29604a1ed32e8598d39a2ceb632578e5fb8512b8

Refactoring, failing test fixture for large record batch

4c1d08c

Change-Id: I340cb6744d883c361b64d090133773c084586ccf

Get LargeRecordBatch round trip working. Add to Message union for now

36c3862

Change-Id: I652fbd1cf1859b77e5efb96e0cf9a10d8c8cb8d4

Add unit test for large record batches. Use bytewise comparisons with…

179a1e3

… aligned bitmaps Change-Id: I92703a6bb6a6d20876efe9fac2844b9b15f4ba12

cpplint

d7811f2

Change-Id: Ifd0181a256aabb24625d94cd01b5fb8e681e9ba7

wesm mentioned this pull request Mar 19, 2017

ARROW-655: [C++/Python] Implement DecimalArray #403

Closed

3 tasks

Fix import ordering

9c18a95

Change-Id: I9fe7248278965688c96a48917191d1e2c7b0fb9f

xhochy approved these changes Mar 20, 2017

View reviewed changes

asfgit closed this in df2220f Mar 20, 2017

wesm deleted the ARROW-661 branch March 20, 2017 13:34

pcmoritz mentioned this pull request Apr 5, 2017

Make putting large objects work. ray-project/ray#411

Merged

asfimport mentioned this pull request Mar 22, 2017

[Format] Change RecordBatch and Field length members from int to long #16301

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-661: [C++] Add LargeRecordBatch metadata type, IPC support, associated refactoring #404

ARROW-661: [C++] Add LargeRecordBatch metadata type, IPC support, associated refactoring #404

Uh oh!

wesm commented Mar 19, 2017 •

edited

Loading

Uh oh!

wesm commented Mar 19, 2017

Uh oh!

xhochy commented Mar 20, 2017

Uh oh!

xhochy left a comment

Uh oh!

wesm commented Mar 20, 2017

Uh oh!

julienledem commented Mar 20, 2017

Uh oh!

wesm commented Mar 20, 2017

Uh oh!

julienledem commented Mar 21, 2017

Uh oh!

wesm commented Mar 22, 2017

Uh oh!

wesm commented Mar 22, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ARROW-661: [C++] Add LargeRecordBatch metadata type, IPC support, associated refactoring #404

ARROW-661: [C++] Add LargeRecordBatch metadata type, IPC support, associated refactoring #404

Uh oh!

Conversation

wesm commented Mar 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented Mar 19, 2017

Uh oh!

xhochy commented Mar 20, 2017

Uh oh!

xhochy left a comment

Choose a reason for hiding this comment

Uh oh!

wesm commented Mar 20, 2017

Uh oh!

julienledem commented Mar 20, 2017

Uh oh!

wesm commented Mar 20, 2017

Uh oh!

julienledem commented Mar 21, 2017

Uh oh!

wesm commented Mar 22, 2017

Uh oh!

wesm commented Mar 22, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wesm commented Mar 19, 2017 •

edited

Loading