-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-7284: [Java] ensure java implementation meets clarified dictionary spec #5945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @emkornfield |
|
@tianchen92 sorry, I've been backlogged will try to review your open PRs over the next week or so (unless others get to them sooner CC @praveenbingo @pravindra @BryanCutler) |
java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java
Outdated
Show resolved
Hide resolved
java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java
Outdated
Show resolved
Hide resolved
java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/ipc/message/ArrowDictionaryBatch.java
Show resolved
Hide resolved
|
Thanks @emkornfield, PR updated according to your comments. |
| serializeDictionaryBatch(out, dictionary1, false, closeableList); | ||
|
|
||
| // write recordBatch2 | ||
| serializeRecordBatch(out, Arrays.asList(encodedVector1, encodedVector2), closeableList); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this necessary for this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess yes, if we only sent dictionaries without batches, seems dictionaries won't be read since we made this change in https://issues.apache.org/jira/browse/ARROW-6040
… batch Related to [ARROW-7546](https://issues.apache.org/jira/browse/ARROW-7546). Per discussion #5945 (comment). In ARROW-7284, we write a simple method to concat vectors. However, ARROW-7073 is about to concat vector values efficiently, after this PR merged, we should use this new implementation in ArrowReader. Closes #6431 from tianchen92/ARROW-7546 and squashes the following commits: 9bced46 <tianchen> ARROW-7546: Use new implementation to concat vectors values in batch Authored-by: tianchen <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>
… batch Related to [ARROW-7546](https://issues.apache.org/jira/browse/ARROW-7546). Per discussion apache/arrow#5945 (comment). In ARROW-7284, we write a simple method to concat vectors. However, ARROW-7073 is about to concat vector values efficiently, after this PR merged, we should use this new implementation in ArrowReader. Closes #6431 from tianchen92/ARROW-7546 and squashes the following commits: 9bced461c <tianchen> ARROW-7546: Use new implementation to concat vectors values in batch Authored-by: tianchen <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>
…ary spec Related to [ARROW-7284](https://issues.apache.org/jira/browse/ARROW-7284). As discussed on [[link](https://lists.apache.org/thread.html/d0f137e9db0abfcfde2ef879ca517a710f620e5be4dd749923d22c37@%3Cdev.arrow.apache.org%3E)]. This is the java side implementation. 1. It is not required that all dictionary batches occur at the beginning of the IPC stream format (if a the first record batch has an all null dictionary encoded column, the null column's dictionary might not be sent until later in the stream). 2. A second dictionary batch for the same ID that is not a "delta batch" in an IPC stream indicates the dictionary should be replaced. 3. Clarifies that the file format, can only contain 1 "NON-delta" dictionary batch and multiple "delta" dictionary batches. 4. Add an enum to dictionary metadata for possible future changes in what format dictionary batches can be sent. (the most likely would be an array Map<Int, Value>). An enum is needed as a place holder to allow for forward compatibility past the release 1.0.0. Closes apache#5945 from tianchen92/ARROW-7284 and squashes the following commits: 16a609d <tianchen> resolve some comments 1231756 <tianchen> ARROW-7284: ensure java implementation meets clarified dictionary spec Authored-by: tianchen <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>
Related to ARROW-7284.
As discussed on [link]. This is the java side implementation.
It is not required that all dictionary batches occur at the beginning
of the IPC stream format (if a the first record batch has an all null
dictionary encoded column, the null column's dictionary might not be sent
until later in the stream).
A second dictionary batch for the same ID that is not a "delta batch"
in an IPC stream indicates the dictionary should be replaced.
Clarifies that the file format, can only contain 1 "NON-delta"
dictionary batch and multiple "delta" dictionary batches.
Add an enum to dictionary metadata for possible future changes in what
format dictionary batches can be sent. (the most likely would be an array
Map<Int, Value>). An enum is needed as a place holder to allow for forward
compatibility past the release 1.0.0.