ARROW-366 Java Dictionary Vector #309

elahrvivaz · 2017-01-30T15:27:33Z

I've added a dictionary type, and a partial implementation of a dictionary vector that just wraps an index vector and has a reference to a lookup vector. The spec seems to indicate that any array can be dictionary encoded, but the C++ implementation created a new type, so I went that way.
Feedback would be appreciated - I want to make sure I'm on the right path.

wesm · 2017-01-30T16:39:20Z

Dictionary is not a new type, so the changes in Message.fbs need to be reverted. I will comment in more detail about how dictionary vectors/arrays interact with the messaging system in a short while

elahrvivaz · 2017-01-30T16:54:33Z

thanks - the reason I added it was because it seemed like adding a new type to ArrowTypes.tdd was required to have a DictionaryVector, and that required a matching type in the Message.fbs, otherwise I end up with a block of code in arrow/java/vector/target/generated-sources/org/apache/arrow/vector/types/pojo/ArrowType.java that doesn't compile:

  public static class Dictionary extends ArrowType {
    public static final ArrowTypeID TYPE_TYPE = ArrowTypeID.Dictionary;
    public static final Dictionary INSTANCE = new Dictionary();

    ...

    @Override
    public int getType(FlatBufferBuilder builder) {
      org.apache.arrow.flatbuf.Dictionary.startDictionary(builder);
      return org.apache.arrow.flatbuf.Dictionary.endDictionary(builder);
    }

    ...

  }

elahrvivaz · 2017-01-30T19:01:20Z

Does it make sense to have a separate dictionary vector? or just add dictionary encoding to all vectors (which matches the message format)?

wesm · 2017-01-30T19:24:13Z

@elahrvivaz the design of the in-memory vector containers and the messaging format need not be so tightly coupled.

The way we decided to handle dictionary encoding in an IPC/RPC setting (e.g. the streaming or file formats) is:

Dictionary-encoded fields have their dictionary metadata field populated: https://github.com/apache/arrow/blob/master/format/Message.fbs#L156
If a field is dictionary encoded, then there will be a corresponding DictionaryBatch message: https://github.com/apache/arrow/blob/master/format/Message.fbs#L277. In the case of the streaming format, the dictionaries must come first in the stream because otherwise we cannot completely construct the schema (which contains field types plus dictionaries). In the file format, the dictionaries can be written anywhere in the file, and the block locations for the dictionaries are found in the footer: https://github.com/apache/arrow/blob/master/format/File.fbs#L31
A dictionary can be shared amongst multiple vectors without copying

In the C++ implementation, I chose to:

Encapsulate the dictionary and dictionary metadata in the arrow::DictionaryType "synthetic" type https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L535 -- I say synthetic because, at least per the Arrow format and metadata, dictionary encoding is not a data type. This makes it straightforward to construct a DictionaryArray (i.e. DictionaryVector) as the composition of an integer array/vector and an instance of DictionaryType: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L417
The DictionaryArray data structure itself is not a nested type, but a composition of the dictionary type object and an array of indices: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L436

Being not an expert in the Arrow Java library design, I'm not sure the best way to accomplish the same goals, but if it's possible to implement DictionaryVector as a composition rather than as a nested type, then that would be ideal.

Adding dictionary-encoding details to each of the primitive vector containers would probably add too much API complexity that would be better encapsulated in one place. This probably may need to all fall outside the code generation path. From a user API perspective, we need only be able to construct accessors to the indices and dictionary:

ValueVector indices = dictVector.getIndices();
ValueVector dictionary = dictVector.getDictionary();

In the file and streaming formats, the schema construction and deconstruction becomes more complicated because we have to read all the dictionaries to construct the schema / conversely extract all the dictionaries and write them out in the Message metadata.

@jacques-n @julienledem does this jive with what you've been thinking? how would you recommend proceeding?

julienledem · 2017-01-30T19:45:32Z

seconded that the metadata should not change.
I'd say the dictionary ids should be encode using the appropriate int vector and the values with the corresponding value vector. The main thing to provide is record readers on top of this.

elahrvivaz · 2017-01-31T18:16:08Z

Thanks guys. I've reverted the changes to the metadata, and added a dictionary field to the Field pojo that maps to the flatbuf implementation. The DictionaryVector is now a synthetic type, as you suggested. For maps (structs) and lists, I overloaded most of the writer methods to allow a dictionary flag, which will get set in the Field when the schema is created (this doesn't use the synthetic type yet, but it probably should). The dictionary itself isn't passed around in this case, just the id. Does this seem more like you were thinking? If so, I'll modify the readers to create 'DictionaryVector`s when the field has a dictionary and add some unit tests.
As you said @wesm, the more complicated work is the file and streaming encoding - should that be part of this PR or a separate one?
Thanks,

wesm · 2017-01-31T19:05:59Z

@elahrvivaz this is looking better. I don't grasp the dictionary flags in the writer methods, could you explain? I suspect that in practice in java, dictionary-encoding is something that you would do after constructing a particular record batch of interest.

elahrvivaz · 2017-01-31T19:19:41Z

@wesm I was looking at the 0.1 version of VectorUnloader, which generated the schema through the vector Fields. Let me re-assess with the latest code...

wesm · 2017-01-31T19:23:31Z

Thanks. My guess is that the changes you made under codegen/ are going to be rejected

wesm

Thes issues are very minor. As with the C++ implementation, I think we will need to go through the paces of implementing the streaming and file formats to shake out issues with the implementation.

This patch will need some tests cases exhibiting a fully formed DictionaryVector, and perhaps a RecordBatch that contains a DictionaryVector as one of its fields

wesm · 2017-02-01T01:51:41Z

java/vector/src/main/java/org/apache/arrow/vector/complex/DictionaryVector.java

The dictionary id is an implementation detail in the messaging metadata -- I don't think we need this in this class

thanks, I wasn't sure if we would always have a dictionary available or not

wesm · 2017-02-01T01:52:41Z

java/vector/src/main/java/org/apache/arrow/vector/types/Dictionary.java

I'm not sure if the dictionary id should be stored here, @jacques-n? It's more likely that we could have a Map<int, Dictionary> someplace to look up dictionaries by id

sorry i meant to bug @julienledem for comment

@julienledem I am not sure if a dictionary id needs to be assigned at dictionary construction time. This feels like something that should be dealt with when you go to write the data to file or streaming format

As discussed above, I believe removing the id field gives us more flexibility

wesm · 2017-02-01T01:53:34Z

java/vector/src/main/java/org/apache/arrow/vector/complex/DictionaryVector.java

The spec provides for the indices being any signed integer type. I'm not sure where will be the most appropriate place to check for this

I'll add some validation here

julienledem · 2017-02-02T00:46:11Z

Your changes to Field seem fine to me.
I don't think we need a DictionaryVector that just wraps the indices vector. We would get the indices vector and work with it instead.
As a first step I'd suggest just testing generating dictionary vectors (values), and data with indices and making sure the metadata goes through (like you did with Field). After that we can make a class that dictionary encodes or dictionary decodes existing vectors.

wesm · 2017-02-02T01:19:56Z

@julienledem I don't quite understand what you wrote. We need some kind of container class that is a subclass of ValueVector or FieldVector, though, right? Otherwise, what does the object structure of a dictionary-encoded vector look like?

elahrvivaz · 2017-02-02T14:41:16Z

@julienledem So it seems like you're saying dictionary encoding is just a flag in the Field metadata, and we just need a way to set that flag appropriately, but we don't need any dictionary specific classes. That is what I was trying to accomplish with the overloaded methods in the codegen changes, although that might not have been the best way to accomplish it. That also implies that the dictionary ID is set in the field, not during encoding.

julienledem · 2017-02-02T17:44:14Z

Let me rephrase my comment:
We don't necessarily need an abstraction that makes the dictionary encoded vector look like the decoded vector. We can turn a regular value vector into a dictionary encoded vector along with a dictionary (either precomputed or not). We can also convert a dictionary vector into a decoded value vector. I think we'd rather have those operations since the goal of vectorized execution is to peel out abstraction layer.
To test the dictionary vector you can:

create a regular value vector
have dictionary encoder that produces a dictionary vector
have a dictionary decoder that produces a decoded value vector
check that the result is the same as the original

elahrvivaz · 2017-02-02T17:49:14Z

Ok, that makes sense. I assumed the dictionary mapping would be handled by the user, not the library.

elahrvivaz · 2017-02-02T20:28:45Z

@julienledem @wesm Do you guys have any thoughts on dictionary encoding for map/struct vectors? I added some methods for encoding simple vectors, but this seems to break down with complex vectors. Seems like one should be able to e.g. have some dictionary encoded fields in a map vector, and other regular fields. I don't see any public methods to create a map vector from existing children - would it be ok to add that, or to add some dictionary encoding code to AbstractMapVector?

wesm · 2017-02-03T16:22:31Z

java/vector/src/main/java/org/apache/arrow/vector/types/Dictionary.java

@julienledem I am not sure if a dictionary id needs to be assigned at dictionary construction time. This feels like something that should be dealt with when you go to write the data to file or streaming format

wesm · 2017-02-03T16:22:50Z

java/vector/src/main/java/org/apache/arrow/vector/complex/DictionaryVector.java

Prefer signed int32 for the indices

wesm · 2017-02-03T16:31:05Z

java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java

The dictionary id is what goes in the IPC metadata, but how will this work in a strictly in-memory setting? In normal use, when you create a DictionaryVector field, the id will be null here. Then in the file/stream adapter code, you will need to either use the dictionary id here or come up with a new one.

At least in C++, I have planned for the dictionary id to be an encapsulated detail of the stream/file formats, but is otherwise not part of the public API. So I guess what I am asking is whether the dictionary id might have any use in Java outside of files and streams.

I may also be a bit confused -- if these Field objects are only used for record batch disassembly and reassembly, then these changes are fine

I'm not sure that the dictionary id has any use outside the stream/file formats. I was hesitant to mess with the stream/file encoding, so it seemed easier to add it here. I'm not sure of the repercussions either way...

With fresh eyes, I think it's OK to leave this here for now. We'll have to resolve some implementation questions in the course of handling dictionaries in the file/stream code

wesm · 2017-02-03T16:32:16Z

@elahrvivaz It would be nice to be able to hash and dictionary encode complex types, but that is a good sized project that we should tackle in a separate patch

julienledem · 2017-02-04T02:16:16Z

@elahrvivaz agreed. Simple types only is a great start

elahrvivaz · 2017-02-06T16:22:12Z

@wesm @julienledem I've added the dictionary encode/decode methods, and unit tests for varchars (seems like the primary use case). Is there anything else that needs to be addressed here? I'd like to work on file encoding next, but will start a new PR for that.
Thanks,

wesm · 2017-02-07T14:46:40Z

I'm sorry about the broken builds. If you rebase, you should get a passing Travis CI build.

I will review this again today — I will do my best to keep pace with you on implementing the Stream/File integration for dictionaries so we can have working integration tests at the end.

@julienledem can you also take a look so we can get this merged soon? Thanks!

elahrvivaz · 2017-02-07T14:59:24Z

@wesm thanks a lot! you've been very helpful and responsive.

wesm · 2017-02-07T15:09:03Z

@elahrvivaz I recommend avoiding the git merge command if at all possible -- in our PR merge tool we squash away the merge commits, but they can cause problems in other projects. In the future, it will be better to use git rebase (possibly with git rebase -i first if you need to combine commits) when updating patches with the latest from master

* Static encode methods to convert a vector into a dictionary encoded vector * Encoding can use a provided dictionary or create a new one * Not implemented for complex vectors or byte array vectors * Added dictionary ID attribute to Field * Updated Text equals and hashcode to allow dictionary lookups

elahrvivaz · 2017-02-07T15:20:11Z

thanks, rebased and squashed to a single commit

wesm

This gets a +1 from me with the id removed from the Dictionary class and the encode method. I think the other design questions will be harder to resolve until proceeding to the file/stream implementation stage

wesm · 2017-02-07T16:22:56Z

java/vector/src/main/java/org/apache/arrow/vector/complex/DictionaryVector.java

+   * @param dictionaryId the id to use for the newly created dictionary
+   * @return dictionary encoded vector
+   */
+  public static DictionaryVector encode(ValueVector vector, long dictionaryId) {


I would prefer to nix the dictionaryId here and in the Dictionary class, and instead make that an implementation detail that is handled in the stream/file code. The reason is that data may come from many different places, and dictionary ids may conflict if we do not know about all of them a priori. Whereas in the messaging / IPC code, we have the opportunity to assign unambiguous ids to the dictionaries

ok, we'll just have to update the Field id during encoding then. I'll take it out here and leave that for the encoding work.

removed in both places

wesm · 2017-02-07T16:23:45Z

java/vector/src/main/java/org/apache/arrow/vector/types/Dictionary.java

As discussed above, I believe removing the id field gives us more flexibility

wesm · 2017-02-07T16:24:51Z

java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java

With fresh eyes, I think it's OK to leave this here for now. We'll have to resolve some implementation questions in the course of handling dictionaries in the file/stream code

wesm

+1. thank you for the patient labor! Your PR title is missing a : after ARROW-366, let me see if I can appease the PR tool and merge this now

elahrvivaz · 2017-02-07T21:46:30Z

great, thanks!

This supersedes apache#309 and incorporates the `std::shared_ptr<const KeyValueMetadata>` pattern so less copying is needed in Parquet for metadata inbound from Arrow (and vice versa). close apache#309 Author: Wes McKinney <[email protected]> Author: Phillip Cloud <[email protected]> Closes apache#314 from wesm/PARQUET-595 and squashes the following commits: c0199c5 [Wes McKinney] Remove some more std::string includes 3d3be4e [Wes McKinney] Remove string include b2ed09e [Wes McKinney] Add backwards compatible schema APIs 116575a [Wes McKinney] Use std::shared_ptr<const KeyValueMetadata> from upstream Arrow 5116eaa [Phillip Cloud] Add support for reading/writing Schema-level Arrow metadata Change-Id: I80d6443efcd89c52b09a357e7b1d9eeabdff79b8

This supersedes apache#309 and incorporates the `std::shared_ptr<const KeyValueMetadata>` pattern so less copying is needed in Parquet for metadata inbound from Arrow (and vice versa). close apache#309 Author: Wes McKinney <[email protected]> Author: Phillip Cloud <[email protected]> Closes apache#314 from wesm/PARQUET-595 and squashes the following commits: c0199c5 [Wes McKinney] Remove some more std::string includes 3d3be4e [Wes McKinney] Remove string include b2ed09e [Wes McKinney] Add backwards compatible schema APIs 116575a [Wes McKinney] Use std::shared_ptr<const KeyValueMetadata> from upstream Arrow 5116eaa [Phillip Cloud] Add support for reading/writing Schema-level Arrow metadata Change-Id: Ib46a73ac77cc952b032f0f93ee3297808b9f959e

This supersedes #309 and incorporates the `std::shared_ptr<const KeyValueMetadata>` pattern so less copying is needed in Parquet for metadata inbound from Arrow (and vice versa). close #309 Author: Wes McKinney <[email protected]> Author: Phillip Cloud <[email protected]> Closes #314 from wesm/PARQUET-595 and squashes the following commits: c0199c5 [Wes McKinney] Remove some more std::string includes 3d3be4e [Wes McKinney] Remove string include b2ed09e [Wes McKinney] Add backwards compatible schema APIs 116575a [Wes McKinney] Use std::shared_ptr<const KeyValueMetadata> from upstream Arrow 5116eaa [Phillip Cloud] Add support for reading/writing Schema-level Arrow metadata Change-Id: Ib46a73ac77cc952b032f0f93ee3297808b9f959e

I've added a dictionary type, and a partial implementation of a dictionary vector that just wraps an index vector and has a reference to a lookup vector. The spec seems to indicate that any array can be dictionary encoded, but the C++ implementation created a new type, so I went that way. Feedback would be appreciated - I want to make sure I'm on the right path. Author: Emilio Lahr-Vivaz <[email protected]> Closes apache#309 from elahrvivaz/ARROW-366 and squashes the following commits: 60836ea [Emilio Lahr-Vivaz] removing dictionary ID from encoded vector 0871e13 [Emilio Lahr-Vivaz] ARROW-366 Adding Java dictionary vector

wesm reviewed Feb 1, 2017

View reviewed changes

wesm reviewed Feb 3, 2017

View reviewed changes

elahrvivaz changed the title ~~WIP: ARROW-366 Java Dictionary Vector~~ ARROW-366 Java Dictionary Vector Feb 6, 2017

elahrvivaz force-pushed the ARROW-366 branch from 9b0bafc to 0871e13 Compare February 7, 2017 15:20

wesm reviewed Feb 7, 2017

View reviewed changes

removing dictionary ID from encoded vector

60836ea

wesm approved these changes Feb 7, 2017

View reviewed changes

asfgit closed this in c322cbf Feb 7, 2017

elahrvivaz deleted the ARROW-366 branch February 10, 2017 16:14

elahrvivaz mentioned this pull request Oct 18, 2018

ARROW-3396: [WIP] [Java] VectorSchemaRoot.create(schema, allocator) does not create the expected vector type for dictionary-encoded fields #2681

Closed

ARROW-366 Java Dictionary Vector #309

ARROW-366 Java Dictionary Vector #309

Uh oh!

Conversation

elahrvivaz commented Jan 30, 2017

Uh oh!

wesm commented Jan 30, 2017

Uh oh!

elahrvivaz commented Jan 30, 2017

Uh oh!

elahrvivaz commented Jan 30, 2017

Uh oh!

wesm commented Jan 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

julienledem commented Jan 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elahrvivaz commented Jan 31, 2017

Uh oh!

wesm commented Jan 31, 2017

Uh oh!

elahrvivaz commented Jan 31, 2017

Uh oh!

wesm commented Jan 31, 2017

Uh oh!

wesm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

julienledem commented Feb 2, 2017

Uh oh!

wesm commented Feb 2, 2017

Uh oh!

elahrvivaz commented Feb 2, 2017

Uh oh!

julienledem commented Feb 2, 2017

Uh oh!

elahrvivaz commented Feb 2, 2017

Uh oh!

elahrvivaz commented Feb 2, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Feb 3, 2017

Uh oh!

julienledem commented Feb 4, 2017

Uh oh!

elahrvivaz commented Feb 6, 2017

Uh oh!

wesm commented Feb 7, 2017

Uh oh!

wesm commented Jan 30, 2017 •

edited

Loading

julienledem commented Jan 30, 2017 •

edited

Loading

wesm left a comment •

edited

Loading

wesm commented Feb 7, 2017 •

edited

Loading