Skip to content

[C++][Compute] replace_substring_regex() creates invalid arrays => crash #28514

@asfimport

Description

@asfimport

min

arr = pa.array(['A'] * 16)
arr2 = pa.compute.replace_substring_regex(arr, pattern="X", replacement="Y")
arr2.validate(full=True)

Expected results: a valid array
Actual results: pyarrow.lib.ArrowInvalid: Offset invariant failure: non-monotonic offset at slot 64: 0 < 63

So if you run arr.diff(arr2), you'll get something like:

terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_S_create
Aborted (core dumped)

This seems to happen if and only if the input array length is a multiple of 16. That leads to an ugly workaround:

def replace_substring_regex_workaround_12774(
    array: pa.Array,
    *,
    pattern: str,
    replacement: str
) -> pa.Array:
    if len(array) > 0 and len(array) % 16 == 0:
        chunked_array = pa.chunked_array([array.slice(0, 1), array.slice(1)], type=array.type)
        return pa.compute.replace_substring_regex(
            chunked_array,
            pattern=pattern,
            replacement=replacement
        ).combine_chunks()
    else:
        return pa.compute.replace_substring_regex(
            array,
            pattern=pattern,
            replacement=replacement
        )

Reporter: Adam Hooper / @adamhooper
Assignee: Niranda Perera / @nirandaperera

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-12774. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions