Skip to content

[C++] extract_regex gives bizarre behavior after nulls or non-matches #28418

@asfimport

Description

@asfimport

After a non-match, the subsequent string may match ... but its data is in the wrong array element.

>>> pa.compute.extract_regex(pa.array(["a", "b", "c", "d"]), pattern="(?P<x>[^b])")
<pyarrow.lib.StructArray object at 0x7f80de918ee0>
-- is_valid:
  [
    true,
    false,
    true,
    true
  ]
-- child 0 type: string
  [
    "a",
    "",
    "",
    "c"
  ]

Same if trying to match after null:

>>> pa.compute.extract_regex(pa.array(["a", None, "c", "d", "e"]), pattern="(?P<x>[^b])")
<pyarrow.lib.StructArray object at 0x7f80de918ee0>
-- is_valid:
  [
    true,
    false,
    true,
    true,
    true
  ]
-- child 0 type: string
  [
    "a",
    "",
    "",
    "c",
    "d"
  ]

Workaround: 1) filter out non-matches; 2) extract only the matching strings; 3) interpolate nulls:

def _extract_regex_workaround_arrow_12670(
    array: pa.StringArray, *, pattern: str
) -> pa.StructArray:
    ok = pa.compute.match_substring_regex(array, pattern=pattern)
    good = array.filter(ok)
    good_matches = pa.compute.extract_regex(good, pattern=pattern)

    # Build array that looks like [None, 1, None, 2, 3, 4, None, 5]
    # ... ok_nonnull: [False, True, False, True, True, True, False, True]
    # (not ok.fill_null(False).cast(pa.int8()) because of ARROW-12672 segfault)
    ok_nonnull = pa.compute.and_kleene(ok.is_valid(), ok)
    # ... np_ok: [0, 1, 0, 1, 1, 1, 0, 1]
    np_ok = ok_nonnull.cast(pa.int8()).to_numpy(zero_copy_only=False)
    # ... np_index: [0, 1, 1, 2, 3, 4, 4, 5]
    np_index = np.cumsum(np_ok, dtype=np.int64) - 1
    # ...index_or_null: [None, 1, None, 3, 4, 5, None, 5]
    valid = ok_nonnull.buffers()[1]
    index_or_null = pa.Array.from_buffers(
        pa.int64(), len(array), [valid, pa.py_buffer(np_index)]
    )

    return good_matches.take(index_or_null)

Reporter: Adam Hooper / @adamhooper
Assignee: Antoine Pitrou / @pitrou
Watchers: Rok Mihevc / @rok

PRs and other links:

Note: This issue was originally created as ARROW-12670. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions