Skip to content

Conversation

@mbostock
Copy link
Contributor

@mbostock mbostock commented Mar 7, 2024

Rationale for this change

The ProxyHandler.ownKeys implementation must return strings or symbols. Because of this bug, it was returning numbers, causing the in operator to crash when trying to iterate over the keys of a MapRow object.

An example of this is a DuckDB SQL query using the HISTOGRAM operator:

SELECT HISTOGRAM(phot_g_mean_mag) FROM gaia

What changes are included in this PR?

Instead of calling array.map(String), which returns a typed array of non-strings when array is a typed array, call Array.from which is guaranteed to return strings.

Are these changes tested?

Apologies, but I don’t know how to test this.

Are there any user-facing changes?

This fixes a crash when using the in operator on a MapRow object.

@github-actions
Copy link

github-actions bot commented Mar 7, 2024

⚠️ GitHub issue #40407 has been automatically assigned in GitHub to PR creator.

return Array.from(row[kKeys].toArray(), String);
}
has(row: MapRow<K, V>, key: string | symbol) {
return row[kKeys].includes(key);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I’m not fixing this in this PR, but I suspect that there’s also a bug here: if key is a string, and rows[kKeys] is a typed array, then includes(key) will always return false. You’d need to coerce the key to a number, or coerce rows[kKeys] to strings, for this test to return true. Same with other usage of row[kKeys].indexOf(key) below, I expect.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbostock any reason not to fix it in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not familiar enough with this codebase to do it confidently and to test. I’d welcome someone else doing it.

@trxcllnt
Copy link
Contributor

trxcllnt commented Mar 7, 2024

LGTM, thanks @mbostock!

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Mar 7, 2024
@mbostock
Copy link
Contributor Author

mbostock commented Mar 7, 2024

Woot, thanks for the approval @trxcllnt. I’m excited to contribute. 😁

@trxcllnt
Copy link
Contributor

trxcllnt commented Mar 7, 2024

IIRC the Map keys are required to be strings (tho I'm not sure we prevent someone from using a different type). Or has newer Arrow has loosened this requirement? That's the only reason I can think we'd have made the assumption the keys would always be strings here.

@trxcllnt
Copy link
Contributor

trxcllnt commented Mar 7, 2024

I personally prefer supporting maps with keys of any Arrow dtype, but not sure how compatible with other implementations that will be. I do see the integration tests seem to only test maps with string keys.

@domoritz
Copy link
Member

domoritz commented Mar 7, 2024

Python (which uses C++ which could be considered the canonical implementation) only allows string keys.

Traceback (most recent call last):
  File "/Users/dominik/Code/ramsch/map.py", line 8, in <module>
    table = pa.table({'a': range(10), 'b': np.random.randn(10), 'c': [{x: x} for x in range(10)]})
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 5204, in pyarrow.lib.table
  File "pyarrow/table.pxi", line 1813, in pyarrow.lib._Tabular.from_pydict
  File "pyarrow/table.pxi", line 5339, in pyarrow.lib._from_pydict
  File "pyarrow/array.pxi", line 374, in pyarrow.lib.asarray
  File "pyarrow/array.pxi", line 344, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected dict key of type str or bytes, got 'int'

Rust doesn't seem to enforce string keys and we couldn't find anything in the spec so it seems to be a gray area right now.

@trxcllnt
Copy link
Contributor

trxcllnt commented Mar 7, 2024

Yeah, I'm inclined to tell DuckDB their Map implementation is non-conformant. Generally whatever libarrow/pyarrow does (and what's in the integration tests) is the source of truth.

Even if JS allows non-string keys, other implementations probably won't. This has been a significant source of confusion in the past, so we try to align with C++ as much as possible.

Copy link
Contributor

@trxcllnt trxcllnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing to "request changes" until we can get some clarification about non-string Map keys. Maybe @pitrou or @kou have thoughts about this?

@CurtHagenlocher
Copy link
Contributor

It is absolutely not the case that maps must have string keys. As just one example, part of the definition of ADBC metadata includes a map<int32, list<int32>>.

@trxcllnt
Copy link
Contributor

trxcllnt commented Mar 7, 2024

You're right, I don't see C++ checking the keys type in MapArray::ValidateChildData(). Maybe the string keys is just a thing pyarrow enforces. Either way, it'd be good to get some integration tests to settle any disagreements.

@pitrou
Copy link
Member

pitrou commented Mar 7, 2024

This is not how you would create a Map array with PyArrow. Example:

>>> ty = pa.map_(pa.int8(), pa.int32())
>>> a = pa.array([{1: 1000, 2: 10000}, {3: -1000}], type=ty)
>>> a
<pyarrow.lib.MapArray object at 0x7f7ab948dd80>
[
  keys:
  [
    1,
    2
  ]
  values:
  [
    1000,
    10000
  ],
  keys:
  [
    3
  ]
  values:
  [
    -1000
  ]
]

@domoritz
Copy link
Member

domoritz commented Mar 8, 2024

Oh, pa.table doesn't recognize the dictionary. Thanks for the corrected code!

@mbostock
Copy link
Contributor Author

Anything else I can do to help this land? Seems like we still want this fix.

@mbostock
Copy link
Contributor Author

mbostock commented Apr 2, 2024

Apologies for bumping, but I’d love to help move this forward as this is currently preventing us from using DuckDB’s HISTOGRAM function (e.g. SELECT HISTOGRAM(phot_g_mean_mag) crashes per observablehq/framework#1014). Please let me know if there is anything I can do.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Apr 16, 2024
@domoritz domoritz merged commit 117460b into apache:main Apr 16, 2024
@mbostock mbostock deleted the mbostock/fix-string-map branch April 17, 2024 00:21
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 117460b.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants