Skip to content

FindRepeatead and Unicode / may break OP_STAR/PLUS/... #371

@User4martin

Description

@User4martin

I have not further analysed this...

FindRepeated (for unicode) calls IncUnicode2 which may (for surrogates) increment by 2. For the OPs that can match a surrogate this will be a problem.

OP_STAR/.... in MatchPrim will iterate the returned range in steps of one ReChar (codeunit): regInput := save + no;

Also the result of FindRepeated may be the

  • codeunits for OP_ANY (counting a surrogate as 2)
  • "Chars"/full-codepoints for any of the OP_NOT... (counting a surrogate as 1)

One way I can think of (.+).

  • if the last char in the text is a surrogate, then the capture matches half a surrogate
  • if the text is exactly one char, and that is a surrogate, then it incorrectly matches. It needs 2 chars, and takes each half of the surrogate as a full char.

OP_STAR goes back half the surrogate, and then OP_ANY does not check that it matches the 2nd part of a surrogate


This may be fixable (but I have not tested)

  • OP_STAR... in MatchPrim must check regInput := save + no; points to the 2nd part of a surrogate
  • FindRepeated always most return the amount of codeunits (ReChars) / always counting a surrogate as 2.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions