Skip to content

Conversation

@vibhatha
Copy link
Collaborator

@vibhatha vibhatha commented Jun 24, 2024

Rationale for this change

The MapVector keeps the KEY_NAME and VALUE_NAME as a constant values but rather it should be extracted from the provided fields.

What changes are included in this PR?

Updating KEY_NAME and VALUE_NAME from the provided fields. Adding a test case to validate that.

  • Add IPC Test Cases
  • Add C Data Test Cases

Are these changes tested?

Yes. A test case has been added to validate this.

Are there any user-facing changes?

No

@vibhatha vibhatha marked this pull request as ready for review June 24, 2024 01:15
@vibhatha vibhatha requested a review from lidavidm as a code owner June 24, 2024 01:15
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not add mutable global state

@lidavidm
Copy link
Member

Also, we should test this via IPC and C Data

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Jun 25, 2024
@vibhatha
Copy link
Collaborator Author

Let's not add mutable global state

Would it be okay if we use getKey() across the static usages? That is going to be a larger diff.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 25, 2024
@vibhatha vibhatha requested a review from lidavidm June 25, 2024 11:03
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only be if there are 0 children? If there's any other number of children presumably it should fail

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so better to check size==0 and assign the KEY_NAME else, the field from the struct. Also for value, I think the most accurate check would be size < 2, because there could be no key defined, then there could be 1 child (when key was added).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see IPC and C Data integration tests with other languages, not within Java

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidavidm this was an old issue filed for this purpose and we have a C Data interface test being skipped because of that. https://github.com/apache/arrow/issues/24869

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have enabled the non_canonical_map tests in archery integration tests. But I am getting one failure as shown below.

RuntimeError: Command failed: java -Dio.netty.tryReflectionSetAccessible=true -Darrow.struct.conflict.policy=CONFLICT_APPEND -XX:-UsePerfData --add-opens=java.base/java.nio=org.apache.arrow.memory.core,ALL-UNNAMED --add-reads=org.apache.arrow.flight.core=ALL-UNNAMED -cp /home/vibhatha/github/fork/arrow/java/tools/target/arrow-tools-17.0.0-SNAPSHOT-jar-with-dependencies.jar org.apache.arrow.tools.Integration -a /tmp/tmpj67jyv0g/8bbfe7eb_generated_map_non_canonical.consumer_stream_as_file -j /tmp/arrow-integration-7d6i1uz7/generated_map_non_canonical.json -c VALIDATE
With output:
--------------
WARNING: Unknown module: org.apache.arrow.flight.core specified to --add-reads
WARNING: Unknown module: org.apache.arrow.memory.core specified to --add-opens
SLF4J(W): No SLF4J providers were found.
SLF4J(W): Defaulting to no-operation (NOP) logger implementation
SLF4J(W): See https://www.slf4j.org/codes.html#noProviders for further details.
Incompatible files
Different schemas:
Schema<map_other_names: Map(false)<entries: Struct<key: Utf8 not null, value: Int(32, true)> not null>>
Schema<map_other_names: Map(false)<some_entries: Struct<some_key: Utf8 not null, some_value: Int(32, true)> not null>>

I checked two things,

  1. The schema of the Arrow file is Schema<map_other_names: Map(false)<entries: Struct<key: Utf8 not null, value: Int(32, true)> not null>>

Verified by just reading the schema of the loaded file.

  1. I independently checked whether Java JsonReader/Writer can write custom key and value fields.

It can write and read custom fields, I think that's why the Java Producing and Java consuming passes.
Also Java Producing and C++ consuming tests are passing.

But in C++ producing and Java consuming, I get the above error. Also in 1 I checked the schema, and it is what is being noted
Schema<map_other_names: Map(false)<entries: Struct<key: Utf8 not null, value: Int(32, true)> not null>>.

I have a doubt as it seems C++ is not producing the custom schema? Am I missing something here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        # Canonical map names are restored on import, so the schemas are unequal
        .skip_format(SKIP_C_SCHEMA, 'C++')
        .skip_format(SKIP_C_SCHEMA, 'Java'),

there's a comment presumably referencing this. Maybe @pitrou can clarify?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the Java line, though still fails. @pitrou appreciate your feedback?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ only compares map field names if the check_metadata option is enabled (it's disabled by default). Java should probably do something similar.

if (check_metadata_ && (left.item_field()->name() != right.item_field()->name() ||
left.key_field()->name() != right.key_field()->name() ||
left.value_field()->name() != right.value_field()->name())) {
result_ = false;
return Status::OK();
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pitrou for investigating this. I will take a look.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into this, Java uses, the Integration.java program to do this validation as used by the archery. And in the program there is a schema comparison, just by using the content in the generated JSON files and arrow files. So this anyways should fail right? Because the JSON file generated has one schema, and the IPC file has another schema as the loaded data doesn't contain the canonical names in C++?

JSON Schema

  "schema": {
    "fields": [
      {
        "name": "map_other_names",
        "type": {
          "name": "map",
          "keysSorted": false
        },
        "nullable": true,
        "children": [
          {
            "name": "some_entries",
            "type": {
              "name": "struct"
            },
            "nullable": false,
            "children": [
              {
                "name": "some_key",
                "type": {
                  "name": "utf8"
                },
                "nullable": false,
                "children": []
              },
              {
                "name": "some_value",
                "type": {
                  "name": "int",
                  "isSigned": true,
                  "bitWidth": 32
                },
                "nullable": true,
                "children": []
              }
            ]
          }
        ]
      }
    ]
  },

IPC Content

>>> import pyarrow as pa
>>> file_path = "/tmp/tmppa0h4f3m/7a269ab8_generated_map_non_canonical.consumer_stream_as_file"
>>> with pa.OSFile(file_path, 'rb') as source:
...    loaded_array = pa.ipc.open_file(source).read_all()
... 
>>> loaded_array
pyarrow.Table
map_other_names: map<string, int32>
  child 0, entries: struct<key: string not null, value: int32> not null
      child 0, key: string not null
      child 1, value: int32
----
map_other_names: [[keys:["mw3gônj"]values:[null],null,keys:["矢gc6h43","4n矢3°5€"]values:[2147483647,-99106826],keys:["or3iµg£","naadofp","°dfhjrl","hôr£µn£"]values:[1403778175,401427101,null,1171070133],keys:["i2nk4oô"]values:[1754612069],keys:["wiôgid6","dÂom23r"]values:[1528772736,-696511786],keys:["rdÂ3n1r","wµrôÂfc","kfakmko","°2Â3hÂ"]values:[null,1836582554,-1502317905,153924375]]]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @pitrou

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vibhatha I'll repeat what I said above: C++ only compares map field names if the check_metadata option is enabled (it's disabled by default). Java should probably do something similar.

Note that Integration.java calls Validator.compareSchemas...

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting review Awaiting review awaiting committer review Awaiting committer review and removed awaiting change review Awaiting change review awaiting review Awaiting review awaiting changes Awaiting changes awaiting committer review Awaiting committer review labels Jun 27, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jun 28, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 23, 2024
@vibhatha vibhatha requested a review from pitrou August 8, 2024 02:12
@github-actions
Copy link

⚠️ GitHub issue apache/arrow-java#71 has no components, please add labels for components.

@kou
Copy link
Member

kou commented Nov 27, 2024

Could you open a new PR in apache/arrow-java?

@kou kou closed this Nov 27, 2024
@vibhatha
Copy link
Collaborator Author

Sure I will look into it.

@github-actions
Copy link

⚠️ GitHub issue #42218 has no components, please add labels for components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants