-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-366 Java Dictionary Vector #309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Dictionary is not a new type, so the changes in Message.fbs need to be reverted. I will comment in more detail about how dictionary vectors/arrays interact with the messaging system in a short while |
|
thanks - the reason I added it was because it seemed like adding a new type to |
|
Does it make sense to have a separate dictionary vector? or just add dictionary encoding to all vectors (which matches the message format)? |
|
@elahrvivaz the design of the in-memory vector containers and the messaging format need not be so tightly coupled. The way we decided to handle dictionary encoding in an IPC/RPC setting (e.g. the streaming or file formats) is:
In the C++ implementation, I chose to:
Being not an expert in the Arrow Java library design, I'm not sure the best way to accomplish the same goals, but if it's possible to implement DictionaryVector as a composition rather than as a nested type, then that would be ideal. Adding dictionary-encoding details to each of the primitive vector containers would probably add too much API complexity that would be better encapsulated in one place. This probably may need to all fall outside the code generation path. From a user API perspective, we need only be able to construct accessors to the indices and dictionary: ValueVector indices = dictVector.getIndices();
ValueVector dictionary = dictVector.getDictionary();In the file and streaming formats, the schema construction and deconstruction becomes more complicated because we have to read all the dictionaries to construct the schema / conversely extract all the dictionaries and write them out in the Message metadata. @jacques-n @julienledem does this jive with what you've been thinking? how would you recommend proceeding? |
|
seconded that the metadata should not change. |
|
Thanks guys. I've reverted the changes to the metadata, and added a |
|
@elahrvivaz this is looking better. I don't grasp the |
|
@wesm I was looking at the 0.1 version of |
|
Thanks. My guess is that the changes you made under |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thes issues are very minor. As with the C++ implementation, I think we will need to go through the paces of implementing the streaming and file formats to shake out issues with the implementation.
This patch will need some tests cases exhibiting a fully formed DictionaryVector, and perhaps a RecordBatch that contains a DictionaryVector as one of its fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dictionary id is an implementation detail in the messaging metadata -- I don't think we need this in this class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I wasn't sure if we would always have a dictionary available or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if the dictionary id should be stored here, @jacques-n? It's more likely that we could have a Map<int, Dictionary> someplace to look up dictionaries by id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry i meant to bug @julienledem for comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@julienledem I am not sure if a dictionary id needs to be assigned at dictionary construction time. This feels like something that should be dealt with when you go to write the data to file or streaming format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed above, I believe removing the id field gives us more flexibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec provides for the indices being any signed integer type. I'm not sure where will be the most appropriate place to check for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add some validation here
|
Your changes to Field seem fine to me. |
|
@julienledem I don't quite understand what you wrote. We need some kind of container class that is a subclass of |
|
@julienledem So it seems like you're saying dictionary encoding is just a flag in the Field metadata, and we just need a way to set that flag appropriately, but we don't need any dictionary specific classes. That is what I was trying to accomplish with the overloaded methods in the codegen changes, although that might not have been the best way to accomplish it. That also implies that the dictionary ID is set in the field, not during encoding. |
|
Let me rephrase my comment:
|
|
Ok, that makes sense. I assumed the dictionary mapping would be handled by the user, not the library. |
|
@julienledem @wesm Do you guys have any thoughts on dictionary encoding for map/struct vectors? I added some methods for encoding simple vectors, but this seems to break down with complex vectors. Seems like one should be able to e.g. have some dictionary encoded fields in a map vector, and other regular fields. I don't see any public methods to create a map vector from existing children - would it be ok to add that, or to add some dictionary encoding code to AbstractMapVector? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@julienledem I am not sure if a dictionary id needs to be assigned at dictionary construction time. This feels like something that should be dealt with when you go to write the data to file or streaming format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer signed int32 for the indices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dictionary id is what goes in the IPC metadata, but how will this work in a strictly in-memory setting? In normal use, when you create a DictionaryVector field, the id will be null here. Then in the file/stream adapter code, you will need to either use the dictionary id here or come up with a new one.
At least in C++, I have planned for the dictionary id to be an encapsulated detail of the stream/file formats, but is otherwise not part of the public API. So I guess what I am asking is whether the dictionary id might have any use in Java outside of files and streams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may also be a bit confused -- if these Field objects are only used for record batch disassembly and reassembly, then these changes are fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that the dictionary id has any use outside the stream/file formats. I was hesitant to mess with the stream/file encoding, so it seemed easier to add it here. I'm not sure of the repercussions either way...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With fresh eyes, I think it's OK to leave this here for now. We'll have to resolve some implementation questions in the course of handling dictionaries in the file/stream code
|
@elahrvivaz It would be nice to be able to hash and dictionary encode complex types, but that is a good sized project that we should tackle in a separate patch |
|
@elahrvivaz agreed. Simple types only is a great start |
|
@wesm @julienledem I've added the dictionary encode/decode methods, and unit tests for varchars (seems like the primary use case). Is there anything else that needs to be addressed here? I'd like to work on file encoding next, but will start a new PR for that. |
|
I'm sorry about the broken builds. If you rebase, you should get a passing Travis CI build. I will review this again today — I will do my best to keep pace with you on implementing the Stream/File integration for dictionaries so we can have working integration tests at the end. @julienledem can you also take a look so we can get this merged soon? Thanks! |
|
@wesm thanks a lot! you've been very helpful and responsive. |
|
@elahrvivaz I recommend avoiding the |
* Static encode methods to convert a vector into a dictionary encoded vector * Encoding can use a provided dictionary or create a new one * Not implemented for complex vectors or byte array vectors * Added dictionary ID attribute to Field * Updated Text equals and hashcode to allow dictionary lookups
|
thanks, rebased and squashed to a single commit |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gets a +1 from me with the id removed from the Dictionary class and the encode method. I think the other design questions will be harder to resolve until proceeding to the file/stream implementation stage
| * @param dictionaryId the id to use for the newly created dictionary | ||
| * @return dictionary encoded vector | ||
| */ | ||
| public static DictionaryVector encode(ValueVector vector, long dictionaryId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to nix the dictionaryId here and in the Dictionary class, and instead make that an implementation detail that is handled in the stream/file code. The reason is that data may come from many different places, and dictionary ids may conflict if we do not know about all of them a priori. Whereas in the messaging / IPC code, we have the opportunity to assign unambiguous ids to the dictionaries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, we'll just have to update the Field id during encoding then. I'll take it out here and leave that for the encoding work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed in both places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed above, I believe removing the id field gives us more flexibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With fresh eyes, I think it's OK to leave this here for now. We'll have to resolve some implementation questions in the course of handling dictionaries in the file/stream code
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. thank you for the patient labor! Your PR title is missing a : after ARROW-366, let me see if I can appease the PR tool and merge this now
|
great, thanks! |
This supersedes apache#309 and incorporates the `std::shared_ptr<const KeyValueMetadata>` pattern so less copying is needed in Parquet for metadata inbound from Arrow (and vice versa). close apache#309 Author: Wes McKinney <[email protected]> Author: Phillip Cloud <[email protected]> Closes apache#314 from wesm/PARQUET-595 and squashes the following commits: c0199c5 [Wes McKinney] Remove some more std::string includes 3d3be4e [Wes McKinney] Remove string include b2ed09e [Wes McKinney] Add backwards compatible schema APIs 116575a [Wes McKinney] Use std::shared_ptr<const KeyValueMetadata> from upstream Arrow 5116eaa [Phillip Cloud] Add support for reading/writing Schema-level Arrow metadata Change-Id: I80d6443efcd89c52b09a357e7b1d9eeabdff79b8
This supersedes apache#309 and incorporates the `std::shared_ptr<const KeyValueMetadata>` pattern so less copying is needed in Parquet for metadata inbound from Arrow (and vice versa). close apache#309 Author: Wes McKinney <[email protected]> Author: Phillip Cloud <[email protected]> Closes apache#314 from wesm/PARQUET-595 and squashes the following commits: c0199c5 [Wes McKinney] Remove some more std::string includes 3d3be4e [Wes McKinney] Remove string include b2ed09e [Wes McKinney] Add backwards compatible schema APIs 116575a [Wes McKinney] Use std::shared_ptr<const KeyValueMetadata> from upstream Arrow 5116eaa [Phillip Cloud] Add support for reading/writing Schema-level Arrow metadata Change-Id: Ib46a73ac77cc952b032f0f93ee3297808b9f959e
This supersedes #309 and incorporates the `std::shared_ptr<const KeyValueMetadata>` pattern so less copying is needed in Parquet for metadata inbound from Arrow (and vice versa). close #309 Author: Wes McKinney <[email protected]> Author: Phillip Cloud <[email protected]> Closes #314 from wesm/PARQUET-595 and squashes the following commits: c0199c5 [Wes McKinney] Remove some more std::string includes 3d3be4e [Wes McKinney] Remove string include b2ed09e [Wes McKinney] Add backwards compatible schema APIs 116575a [Wes McKinney] Use std::shared_ptr<const KeyValueMetadata> from upstream Arrow 5116eaa [Phillip Cloud] Add support for reading/writing Schema-level Arrow metadata Change-Id: Ib46a73ac77cc952b032f0f93ee3297808b9f959e
I've added a dictionary type, and a partial implementation of a dictionary vector that just wraps an index vector and has a reference to a lookup vector. The spec seems to indicate that any array can be dictionary encoded, but the C++ implementation created a new type, so I went that way. Feedback would be appreciated - I want to make sure I'm on the right path. Author: Emilio Lahr-Vivaz <[email protected]> Closes apache#309 from elahrvivaz/ARROW-366 and squashes the following commits: 60836ea [Emilio Lahr-Vivaz] removing dictionary ID from encoded vector 0871e13 [Emilio Lahr-Vivaz] ARROW-366 Adding Java dictionary vector
I've added a dictionary type, and a partial implementation of a dictionary vector that just wraps an index vector and has a reference to a lookup vector. The spec seems to indicate that any array can be dictionary encoded, but the C++ implementation created a new type, so I went that way.
Feedback would be appreciated - I want to make sure I'm on the right path.