Skip to content

Conversation

@elahrvivaz
Copy link
Contributor

I've added a dictionary type, and a partial implementation of a dictionary vector that just wraps an index vector and has a reference to a lookup vector. The spec seems to indicate that any array can be dictionary encoded, but the C++ implementation created a new type, so I went that way.
Feedback would be appreciated - I want to make sure I'm on the right path.

@wesm
Copy link
Member

wesm commented Jan 30, 2017

Dictionary is not a new type, so the changes in Message.fbs need to be reverted. I will comment in more detail about how dictionary vectors/arrays interact with the messaging system in a short while

@elahrvivaz
Copy link
Contributor Author

thanks - the reason I added it was because it seemed like adding a new type to ArrowTypes.tdd was required to have a DictionaryVector, and that required a matching type in the Message.fbs, otherwise I end up with a block of code in arrow/java/vector/target/generated-sources/org/apache/arrow/vector/types/pojo/ArrowType.java that doesn't compile:

  public static class Dictionary extends ArrowType {
    public static final ArrowTypeID TYPE_TYPE = ArrowTypeID.Dictionary;
    public static final Dictionary INSTANCE = new Dictionary();

    ...

    @Override
    public int getType(FlatBufferBuilder builder) {
      org.apache.arrow.flatbuf.Dictionary.startDictionary(builder);
      return org.apache.arrow.flatbuf.Dictionary.endDictionary(builder);
    }

    ...

  }

@elahrvivaz
Copy link
Contributor Author

Does it make sense to have a separate dictionary vector? or just add dictionary encoding to all vectors (which matches the message format)?

@wesm
Copy link
Member

wesm commented Jan 30, 2017

@elahrvivaz the design of the in-memory vector containers and the messaging format need not be so tightly coupled.

The way we decided to handle dictionary encoding in an IPC/RPC setting (e.g. the streaming or file formats) is:

In the C++ implementation, I chose to:

Being not an expert in the Arrow Java library design, I'm not sure the best way to accomplish the same goals, but if it's possible to implement DictionaryVector as a composition rather than as a nested type, then that would be ideal.

Adding dictionary-encoding details to each of the primitive vector containers would probably add too much API complexity that would be better encapsulated in one place. This probably may need to all fall outside the code generation path. From a user API perspective, we need only be able to construct accessors to the indices and dictionary:

ValueVector indices = dictVector.getIndices();
ValueVector dictionary = dictVector.getDictionary();

In the file and streaming formats, the schema construction and deconstruction becomes more complicated because we have to read all the dictionaries to construct the schema / conversely extract all the dictionaries and write them out in the Message metadata.

@jacques-n @julienledem does this jive with what you've been thinking? how would you recommend proceeding?

@julienledem
Copy link
Member

julienledem commented Jan 30, 2017

seconded that the metadata should not change.
I'd say the dictionary ids should be encode using the appropriate int vector and the values with the corresponding value vector. The main thing to provide is record readers on top of this.

@elahrvivaz
Copy link
Contributor Author

Thanks guys. I've reverted the changes to the metadata, and added a dictionary field to the Field pojo that maps to the flatbuf implementation. The DictionaryVector is now a synthetic type, as you suggested. For maps (structs) and lists, I overloaded most of the writer methods to allow a dictionary flag, which will get set in the Field when the schema is created (this doesn't use the synthetic type yet, but it probably should). The dictionary itself isn't passed around in this case, just the id. Does this seem more like you were thinking? If so, I'll modify the readers to create 'DictionaryVector`s when the field has a dictionary and add some unit tests.
As you said @wesm, the more complicated work is the file and streaming encoding - should that be part of this PR or a separate one?
Thanks,

@wesm
Copy link
Member

wesm commented Jan 31, 2017

@elahrvivaz this is looking better. I don't grasp the dictionary flags in the writer methods, could you explain? I suspect that in practice in java, dictionary-encoding is something that you would do after constructing a particular record batch of interest.

@elahrvivaz
Copy link
Contributor Author

@wesm I was looking at the 0.1 version of VectorUnloader, which generated the schema through the vector Fields. Let me re-assess with the latest code...

@wesm
Copy link
Member

wesm commented Jan 31, 2017

Thanks. My guess is that the changes you made under codegen/ are going to be rejected

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thes issues are very minor. As with the C++ implementation, I think we will need to go through the paces of implementing the streaming and file formats to shake out issues with the implementation.

This patch will need some tests cases exhibiting a fully formed DictionaryVector, and perhaps a RecordBatch that contains a DictionaryVector as one of its fields

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dictionary id is an implementation detail in the messaging metadata -- I don't think we need this in this class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I wasn't sure if we would always have a dictionary available or not

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the dictionary id should be stored here, @jacques-n? It's more likely that we could have a Map<int, Dictionary> someplace to look up dictionaries by id

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry i meant to bug @julienledem for comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julienledem I am not sure if a dictionary id needs to be assigned at dictionary construction time. This feels like something that should be dealt with when you go to write the data to file or streaming format

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed above, I believe removing the id field gives us more flexibility

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec provides for the indices being any signed integer type. I'm not sure where will be the most appropriate place to check for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add some validation here

@julienledem
Copy link
Member

Your changes to Field seem fine to me.
I don't think we need a DictionaryVector that just wraps the indices vector. We would get the indices vector and work with it instead.
As a first step I'd suggest just testing generating dictionary vectors (values), and data with indices and making sure the metadata goes through (like you did with Field). After that we can make a class that dictionary encodes or dictionary decodes existing vectors.

@wesm
Copy link
Member

wesm commented Feb 2, 2017

@julienledem I don't quite understand what you wrote. We need some kind of container class that is a subclass of ValueVector or FieldVector, though, right? Otherwise, what does the object structure of a dictionary-encoded vector look like?

@elahrvivaz
Copy link
Contributor Author

@julienledem So it seems like you're saying dictionary encoding is just a flag in the Field metadata, and we just need a way to set that flag appropriately, but we don't need any dictionary specific classes. That is what I was trying to accomplish with the overloaded methods in the codegen changes, although that might not have been the best way to accomplish it. That also implies that the dictionary ID is set in the field, not during encoding.

@julienledem
Copy link
Member

Let me rephrase my comment:
We don't necessarily need an abstraction that makes the dictionary encoded vector look like the decoded vector. We can turn a regular value vector into a dictionary encoded vector along with a dictionary (either precomputed or not). We can also convert a dictionary vector into a decoded value vector. I think we'd rather have those operations since the goal of vectorized execution is to peel out abstraction layer.
To test the dictionary vector you can:

  • create a regular value vector
  • have dictionary encoder that produces a dictionary vector
  • have a dictionary decoder that produces a decoded value vector
  • check that the result is the same as the original

@elahrvivaz
Copy link
Contributor Author

Ok, that makes sense. I assumed the dictionary mapping would be handled by the user, not the library.

@elahrvivaz
Copy link
Contributor Author

@julienledem @wesm Do you guys have any thoughts on dictionary encoding for map/struct vectors? I added some methods for encoding simple vectors, but this seems to break down with complex vectors. Seems like one should be able to e.g. have some dictionary encoded fields in a map vector, and other regular fields. I don't see any public methods to create a map vector from existing children - would it be ok to add that, or to add some dictionary encoding code to AbstractMapVector?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julienledem I am not sure if a dictionary id needs to be assigned at dictionary construction time. This feels like something that should be dealt with when you go to write the data to file or streaming format

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer signed int32 for the indices

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dictionary id is what goes in the IPC metadata, but how will this work in a strictly in-memory setting? In normal use, when you create a DictionaryVector field, the id will be null here. Then in the file/stream adapter code, you will need to either use the dictionary id here or come up with a new one.

At least in C++, I have planned for the dictionary id to be an encapsulated detail of the stream/file formats, but is otherwise not part of the public API. So I guess what I am asking is whether the dictionary id might have any use in Java outside of files and streams.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may also be a bit confused -- if these Field objects are only used for record batch disassembly and reassembly, then these changes are fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that the dictionary id has any use outside the stream/file formats. I was hesitant to mess with the stream/file encoding, so it seemed easier to add it here. I'm not sure of the repercussions either way...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With fresh eyes, I think it's OK to leave this here for now. We'll have to resolve some implementation questions in the course of handling dictionaries in the file/stream code

@wesm
Copy link
Member

wesm commented Feb 3, 2017

@elahrvivaz It would be nice to be able to hash and dictionary encode complex types, but that is a good sized project that we should tackle in a separate patch

@julienledem
Copy link
Member

@elahrvivaz agreed. Simple types only is a great start

@elahrvivaz elahrvivaz changed the title WIP: ARROW-366 Java Dictionary Vector ARROW-366 Java Dictionary Vector Feb 6, 2017
@elahrvivaz
Copy link
Contributor Author

@wesm @julienledem I've added the dictionary encode/decode methods, and unit tests for varchars (seems like the primary use case). Is there anything else that needs to be addressed here? I'd like to work on file encoding next, but will start a new PR for that.
Thanks,

@wesm
Copy link
Member

wesm commented Feb 7, 2017

I'm sorry about the broken builds. If you rebase, you should get a passing Travis CI build.

I will review this again today — I will do my best to keep pace with you on implementing the Stream/File integration for dictionaries so we can have working integration tests at the end.

@julienledem can you also take a look so we can get this merged soon? Thanks!

@elahrvivaz
Copy link
Contributor Author

@wesm thanks a lot! you've been very helpful and responsive.

@wesm
Copy link
Member

wesm commented Feb 7, 2017

@elahrvivaz I recommend avoiding the git merge command if at all possible -- in our PR merge tool we squash away the merge commits, but they can cause problems in other projects. In the future, it will be better to use git rebase (possibly with git rebase -i first if you need to combine commits) when updating patches with the latest from master

* Static encode methods to convert a vector into a dictionary encoded vector
* Encoding can use a provided dictionary or create a new one
* Not implemented for complex vectors or byte array vectors
* Added dictionary ID attribute to Field
* Updated Text equals and hashcode to allow dictionary lookups
@elahrvivaz
Copy link
Contributor Author

thanks, rebased and squashed to a single commit

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets a +1 from me with the id removed from the Dictionary class and the encode method. I think the other design questions will be harder to resolve until proceeding to the file/stream implementation stage

* @param dictionaryId the id to use for the newly created dictionary
* @return dictionary encoded vector
*/
public static DictionaryVector encode(ValueVector vector, long dictionaryId) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to nix the dictionaryId here and in the Dictionary class, and instead make that an implementation detail that is handled in the stream/file code. The reason is that data may come from many different places, and dictionary ids may conflict if we do not know about all of them a priori. Whereas in the messaging / IPC code, we have the opportunity to assign unambiguous ids to the dictionaries

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, we'll just have to update the Field id during encoding then. I'll take it out here and leave that for the encoding work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed in both places

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed above, I believe removing the id field gives us more flexibility

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With fresh eyes, I think it's OK to leave this here for now. We'll have to resolve some implementation questions in the course of handling dictionaries in the file/stream code

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. thank you for the patient labor! Your PR title is missing a : after ARROW-366, let me see if I can appease the PR tool and merge this now

@asfgit asfgit closed this in c322cbf Feb 7, 2017
@elahrvivaz
Copy link
Contributor Author

great, thanks!

@elahrvivaz elahrvivaz deleted the ARROW-366 branch February 10, 2017 16:14
wesm added a commit to wesm/arrow that referenced this pull request Sep 8, 2018
This supersedes apache#309 and incorporates the `std::shared_ptr<const KeyValueMetadata>` pattern so less copying is needed in Parquet for metadata inbound from Arrow (and vice versa).

close apache#309

Author: Wes McKinney <[email protected]>
Author: Phillip Cloud <[email protected]>

Closes apache#314 from wesm/PARQUET-595 and squashes the following commits:

c0199c5 [Wes McKinney] Remove some more std::string includes
3d3be4e [Wes McKinney] Remove string include
b2ed09e [Wes McKinney] Add backwards compatible schema APIs
116575a [Wes McKinney] Use std::shared_ptr<const KeyValueMetadata> from upstream Arrow
5116eaa [Phillip Cloud] Add support for reading/writing Schema-level Arrow metadata

Change-Id: I80d6443efcd89c52b09a357e7b1d9eeabdff79b8
wesm added a commit to wesm/arrow that referenced this pull request Sep 27, 2018
This supersedes apache#309 and incorporates the `std::shared_ptr<const KeyValueMetadata>` pattern so less copying is needed in Parquet for metadata inbound from Arrow (and vice versa).

close apache#309

Author: Wes McKinney <[email protected]>
Author: Phillip Cloud <[email protected]>

Closes apache#314 from wesm/PARQUET-595 and squashes the following commits:

c0199c5 [Wes McKinney] Remove some more std::string includes
3d3be4e [Wes McKinney] Remove string include
b2ed09e [Wes McKinney] Add backwards compatible schema APIs
116575a [Wes McKinney] Use std::shared_ptr<const KeyValueMetadata> from upstream Arrow
5116eaa [Phillip Cloud] Add support for reading/writing Schema-level Arrow metadata

Change-Id: Ib46a73ac77cc952b032f0f93ee3297808b9f959e
wesm added a commit that referenced this pull request Sep 27, 2018
This supersedes #309 and incorporates the `std::shared_ptr<const KeyValueMetadata>` pattern so less copying is needed in Parquet for metadata inbound from Arrow (and vice versa).

close #309

Author: Wes McKinney <[email protected]>
Author: Phillip Cloud <[email protected]>

Closes #314 from wesm/PARQUET-595 and squashes the following commits:

c0199c5 [Wes McKinney] Remove some more std::string includes
3d3be4e [Wes McKinney] Remove string include
b2ed09e [Wes McKinney] Add backwards compatible schema APIs
116575a [Wes McKinney] Use std::shared_ptr<const KeyValueMetadata> from upstream Arrow
5116eaa [Phillip Cloud] Add support for reading/writing Schema-level Arrow metadata

Change-Id: Ib46a73ac77cc952b032f0f93ee3297808b9f959e
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
I've added a dictionary type, and a partial implementation of a dictionary vector that just wraps an index vector and has a reference to a lookup vector. The spec seems to indicate that any array can be dictionary encoded, but the C++ implementation created a new type, so I went that way.
Feedback would be appreciated - I want to make sure I'm on the right path.

Author: Emilio Lahr-Vivaz <[email protected]>

Closes apache#309 from elahrvivaz/ARROW-366 and squashes the following commits:

60836ea [Emilio Lahr-Vivaz] removing dictionary ID from encoded vector
0871e13 [Emilio Lahr-Vivaz] ARROW-366 Adding Java dictionary vector
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants