Skip to content

Conversation

@tianchen92
Copy link
Contributor

Related to ARROW-7284.
As discussed on [link]. This is the java side implementation.

  1. It is not required that all dictionary batches occur at the beginning
    of the IPC stream format (if a the first record batch has an all null
    dictionary encoded column, the null column's dictionary might not be sent
    until later in the stream).

  2. A second dictionary batch for the same ID that is not a "delta batch"
    in an IPC stream indicates the dictionary should be replaced.

  3. Clarifies that the file format, can only contain 1 "NON-delta"
    dictionary batch and multiple "delta" dictionary batches.

  4. Add an enum to dictionary metadata for possible future changes in what
    format dictionary batches can be sent. (the most likely would be an array
    Map<Int, Value>). An enum is needed as a place holder to allow for forward
    compatibility past the release 1.0.0.

@tianchen92
Copy link
Contributor Author

cc @emkornfield

@github-actions
Copy link

github-actions bot commented Dec 3, 2019

@emkornfield
Copy link
Contributor

@tianchen92 sorry, I've been backlogged will try to review your open PRs over the next week or so (unless others get to them sooner CC @praveenbingo @pravindra @BryanCutler)

@tianchen92
Copy link
Contributor Author

Thanks @emkornfield, PR updated according to your comments.

serializeDictionaryBatch(out, dictionary1, false, closeableList);

// write recordBatch2
serializeRecordBatch(out, Arrays.asList(encodedVector1, encodedVector2), closeableList);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary for this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess yes, if we only sent dictionaries without batches, seems dictionaries won't be read since we made this change in https://issues.apache.org/jira/browse/ARROW-6040

emkornfield pushed a commit that referenced this pull request Feb 18, 2020
… batch

Related to [ARROW-7546](https://issues.apache.org/jira/browse/ARROW-7546).

Per discussion #5945 (comment).

In ARROW-7284, we write a simple method to concat vectors. However, ARROW-7073 is about to concat vector values efficiently, after this PR merged, we should use this new implementation in ArrowReader.

Closes #6431 from tianchen92/ARROW-7546 and squashes the following commits:

9bced46 <tianchen> ARROW-7546:  Use new implementation to concat vectors values in batch

Authored-by: tianchen <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
kou pushed a commit to apache/arrow-java that referenced this pull request Nov 25, 2024
… batch

Related to [ARROW-7546](https://issues.apache.org/jira/browse/ARROW-7546).

Per discussion apache/arrow#5945 (comment).

In ARROW-7284, we write a simple method to concat vectors. However, ARROW-7073 is about to concat vector values efficiently, after this PR merged, we should use this new implementation in ArrowReader.

Closes #6431 from tianchen92/ARROW-7546 and squashes the following commits:

9bced461c <tianchen> ARROW-7546:  Use new implementation to concat vectors values in batch

Authored-by: tianchen <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
…ary spec

Related to [ARROW-7284](https://issues.apache.org/jira/browse/ARROW-7284).
As discussed on [[link](https://lists.apache.org/thread.html/d0f137e9db0abfcfde2ef879ca517a710f620e5be4dd749923d22c37@%3Cdev.arrow.apache.org%3E)]. This is the java side implementation.
1.  It is not required that all dictionary batches occur at the beginning
of the IPC stream format (if a the first record batch has an all null
dictionary encoded column, the null column's dictionary might not be sent
until later in the stream).

2.  A second dictionary batch for the same ID that is not a "delta batch"
in an IPC stream indicates the dictionary should be replaced.

3.  Clarifies that the file format, can only contain 1 "NON-delta"
dictionary batch and multiple "delta" dictionary batches.

4.  Add an enum to dictionary metadata for possible future changes in what
format dictionary batches can be sent. (the most likely would be an array
Map<Int, Value>).  An enum is needed as a place holder to allow for forward
compatibility past the release 1.0.0.

Closes apache#5945 from tianchen92/ARROW-7284 and squashes the following commits:

16a609d <tianchen> resolve some comments
1231756 <tianchen> ARROW-7284:  ensure java implementation meets clarified dictionary spec

Authored-by: tianchen <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants