ARROW-3566: [Format] Clarify the type of dictonary encoded field #2798

icexelloss · 2018-10-19T15:11:32Z

icexelloss · 2018-10-19T15:13:02Z

elahrvivaz · 2018-10-22T12:50:17Z

@wesm I assume the C++ implementation encodes dictionary field types the same way, otherwise we wouldn't be passing the dictionary vector integration tests

BryanCutler · 2018-10-22T20:48:10Z

format/Schema.fbs

  // Name is not required, in i.e. a List
  name: string;
  nullable: bool;
+  // This is the type of the index if the field is dictionary encoded


I don't think this is right. In Java, when a dictionary Field is written it first calls DictionaryUtility.toMessageFormat here which changes the Field to be the dictionary type.

Ah, thanks, that was the bridge to fix the gap between the java API and the IPC format. If I remember now, people didn't want to embed dictionary-related logic throughout the codebase, which would be required if the internal java field type has the dictionary type.

Ok I have checked the integration tests and it seems @BryanCutler is correct, the type here is the type of decoded type, not the index type:

This is taken from the generated json file in integration test

{ "name": "dict1_0", "type": { "name": "utf8" }, "nullable": true, "children": [], "dictionary": { "id": 0, "indexType": { "name": "int", "isSigned": true, "bitWidth": 8 }, "isOrdered": false } }, ...

@wesm Could you please help confirm what the proper type should be here?

icexelloss · 2018-10-24T16:44:27Z

Ok I have checked the integration tests and it seems the type here is the type of decoded type, not the index type:

This is taken from the generated json file in integration test:

      {
        "name": "dict1_0",
        "type": {
          "name": "utf8"
        },
        "nullable": true,
        "children": [],
        "dictionary": {
          "id": 0,
          "indexType": {
            "name": "int",
            "isSigned": true,
            "bitWidth": 8
          },
          "isOrdered": false
        }
      },
      ...

@wesm Would you please help clarify what the correct type is here?

wesm · 2018-10-24T19:43:42Z

In C++ we have a synthetic DictionaryType that holds the dictionary / dictionary type as well as the index type. In the IPC metadata, the field type is that of the decoded type. The DictionaryEncoding table holds the index type

icexelloss · 2018-10-24T20:20:15Z

@wesm Thanks for the clarification. Then I think this PR should be good to go.

I will need to go back to #2681 to see what to do there. Thanks all !

wesm

+1

Clarify the type of dictonary encoded field

594c5a4

icexelloss changed the title ~~Clarify the type of dictonary encoded field~~ ARROW-3566: Clarify the type of dictonary encoded field Oct 19, 2018

BryanCutler reviewed Oct 22, 2018

View reviewed changes

type should be decoded type

4c6ebb1

icexelloss mentioned this pull request Oct 26, 2018

ARROW-3396: [WIP] [Java] VectorSchemaRoot.create(schema, allocator) does not create the expected vector type for dictionary-encoded fields #2681

Closed

wesm changed the title ~~ARROW-3566: Clarify the type of dictonary encoded field~~ ARROW-3566: [Format] Clarify the type of dictonary encoded field Oct 30, 2018

wesm approved these changes Oct 30, 2018

View reviewed changes

wesm closed this in f2bf068 Oct 30, 2018

asfimport mentioned this pull request Jun 3, 2019

Clarify that the type of dictionary encoded field should be the encoded(index) type #19880

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-3566: [Format] Clarify the type of dictonary encoded field #2798

ARROW-3566: [Format] Clarify the type of dictonary encoded field #2798

Uh oh!

icexelloss commented Oct 19, 2018 •

edited

Loading

Uh oh!

icexelloss commented Oct 19, 2018

Uh oh!

elahrvivaz commented Oct 22, 2018

Uh oh!

BryanCutler Oct 22, 2018

Uh oh!

elahrvivaz Oct 22, 2018

Uh oh!

icexelloss Oct 23, 2018

Uh oh!

icexelloss Oct 23, 2018

Uh oh!

icexelloss commented Oct 24, 2018

Uh oh!

wesm commented Oct 24, 2018

Uh oh!

icexelloss commented Oct 24, 2018

Uh oh!

wesm left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ARROW-3566: [Format] Clarify the type of dictonary encoded field #2798

ARROW-3566: [Format] Clarify the type of dictonary encoded field #2798

Uh oh!

Conversation

icexelloss commented Oct 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

icexelloss commented Oct 19, 2018

Uh oh!

elahrvivaz commented Oct 22, 2018

Uh oh!

BryanCutler Oct 22, 2018

Choose a reason for hiding this comment

Uh oh!

elahrvivaz Oct 22, 2018

Choose a reason for hiding this comment

Uh oh!

icexelloss Oct 23, 2018

Choose a reason for hiding this comment

Uh oh!

icexelloss Oct 23, 2018

Choose a reason for hiding this comment

Uh oh!

icexelloss commented Oct 24, 2018

Uh oh!

wesm commented Oct 24, 2018

Uh oh!

icexelloss commented Oct 24, 2018

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

icexelloss commented Oct 19, 2018 •

edited

Loading