-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-3566: [Format] Clarify the type of dictonary encoded field #2798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @wesm @elahrvivaz |
|
@wesm I assume the C++ implementation encodes dictionary field types the same way, otherwise we wouldn't be passing the dictionary vector integration tests |
format/Schema.fbs
Outdated
| // Name is not required, in i.e. a List | ||
| name: string; | ||
| nullable: bool; | ||
| // This is the type of the index if the field is dictionary encoded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is right. In Java, when a dictionary Field is written it first calls DictionaryUtility.toMessageFormat here which changes the Field to be the dictionary type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks, that was the bridge to fix the gap between the java API and the IPC format. If I remember now, people didn't want to embed dictionary-related logic throughout the codebase, which would be required if the internal java field type has the dictionary type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I have checked the integration tests and it seems @BryanCutler is correct, the type here is the type of decoded type, not the index type:
This is taken from the generated json file in integration test
{
"name": "dict1_0",
"type": {
"name": "utf8"
},
"nullable": true,
"children": [],
"dictionary": {
"id": 0,
"indexType": {
"name": "int",
"isSigned": true,
"bitWidth": 8
},
"isOrdered": false
}
},
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm Could you please help confirm what the proper type should be here?
|
Ok I have checked the integration tests and it seems the type here is the type of decoded type, not the index type: This is taken from the generated json file in integration test: @wesm Would you please help clarify what the correct type is here? |
|
In C++ we have a synthetic |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Per discussion on #2681 (comment)