Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Mar 19, 2017

This patch enables the following code for writing record batches exceeding 2^31 - 1

RETURN_NOT_OK(WriteLargeRecordBatch(
    batch, buffer_offset, mmap_.get(), &metadata_length, &body_length, pool_));
return ReadLargeRecordBatch(batch.schema(), 0, mmap_.get(), result);

This also does a fair amount of refactoring and code consolidation related to ongoing code cleaning in arrow_ipc.

These APIs are marked experimental. This does add LargeRecordBatch flatbuffer type to the Message union, but I've indicated that Arrow implementations (e.g. Java) are not required to implement this type. It's strictly to enable C++ users to write very large datasets that have been embedded for convenience in Arrow's structured data model.

cc @pcmoritz @robertnishihara

wesm added 8 commits March 18, 2017 21:47
Change-Id: I33cfa0c572a74920666216351391e2c887ba45ca
…s for IPC metadata and convert to flatbuffers later

Change-Id: Ia464c912d41afd4f68d28e8a58cf9fc8ea5f8797
Change-Id: I8fed2c84cc2e1c38ea04590487f980efacf306c4
… record batch read/write path

Change-Id: I29604a1ed32e8598d39a2ceb632578e5fb8512b8
Change-Id: I340cb6744d883c361b64d090133773c084586ccf
Change-Id: I652fbd1cf1859b77e5efb96e0cf9a10d8c8cb8d4
… aligned bitmaps

Change-Id: I92703a6bb6a6d20876efe9fac2844b9b15f4ba12
Change-Id: Ifd0181a256aabb24625d94cd01b5fb8e681e9ba7
Change-Id: I9fe7248278965688c96a48917191d1e2c7b0fb9f
@wesm
Copy link
Member Author

wesm commented Mar 19, 2017

Java builds are failing intermittently due to Maven Central flakiness

@xhochy
Copy link
Member

xhochy commented Mar 20, 2017

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

@asfgit asfgit closed this in df2220f Mar 20, 2017
@wesm wesm deleted the ARROW-661 branch March 20, 2017 13:34
@wesm
Copy link
Member Author

wesm commented Mar 20, 2017

@xhochy thanks for merging this patch. Since this changed Message.fbs, want to make sure that @julienledem takes a look and understands the issue.

It would be good to provide a reasonable guarantee that data stored in a RecordBatch can be read by all Arrow implementations. So the two ways to solve this issue could have been:

  • Change lengths in FieldNode and RecordBatch from int to long
  • OR add the LargeFieldNode and LargeRecordBatch types

Since this is marked experimental we aren't making any forward or API compatibility guarantees on thsi functionality

@julienledem
Copy link
Member

This looks fine. If the feature becomes real we should consider just changing the length field to long in FIeldNode and specify the restriction in the metadata that supporting length > 2^31 - 1 is optional.

@wesm
Copy link
Member Author

wesm commented Mar 20, 2017

If others feel that would be acceptable, maintaining less code is always preferable from my perspective. I believe the requirement to store vectors with more than INT32_MAX elements on the C++/Python side is not going to go away.

The downside is that if you encounter a RecordBatch in the wild in Java, you may get an exception if it's too big. I'm not sure how concerning that is.

@julienledem
Copy link
Member

I'd recommend that the C++ side does not allow writing vectors with more than INT32_MAX entries by default. You'd have to explicitly enable it. This way people don't inadvertently create things that won't be cross-language compatible.

@wesm
Copy link
Member Author

wesm commented Mar 22, 2017

I'm OK with that. I will open a JIRA about changing the RecordBatch types from int to long

@wesm
Copy link
Member Author

wesm commented Mar 22, 2017

see ARROW-679

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants