Skip to content

Proposal for simplifying case-folding (remove PT_CLIST) #514

@NWilson

Description

@NWilson

One thing that bothers me is the way PCRE2 doesn't handle the ASCII range consistently.

In "normal" (UTF/UCP) mode, the characters /k/i and /s/i are handled differently to /a/i (or all the other characters), with a PT_CLIST opcode, rather than an ordinary OP_CHARI. This means that those two characters miss out on all optimisations and special opcodes like OP_STARI.

It just pains me a little that it's so non-uniform.


I'm about halfway through a PR to get rid of the PT_CLIST technique. But it occurred to me, I should check you like the idea before I spend more time on it!

Current strategy:

  • When encountering /a/i, generate OP_CHARI, and for /k/i generate OP_PROP with PT_CLIST.
  • When processing OP_CHARI, it checks for just two characters only: looks to see if the subject character equals the pattern character, and secondly whether the UCD_OTHERCASE of the pattern character equals the subject character. (Notice how we're othercasing the pattern character, not the subject one. This is what limits OP_CHARI to only matching 2 possible subject characters.)
    match = pat_ch == subj_ch || UCD_OTHERCASE(pat_ch) == subj_ch

New strategy:

  • Just make an OP_CHARI. Don't put the actual character in the opcode - put the case-folded form in the bytecode. So /A/i would generate an OP_CHARI with the case-folded form "a". You already have all the Unicode data to do this - your UCD_OTHERCASE macro is reading the Unicode case-folding data.
  • When matching the OP_CHARI, check whether the subject character equals the (pre-case-folded) pattern character. Secondly, check whether the case-folded subject character matches the (pre-case-folded) pattern character. (Notice now we're othercasing the subject character, so OP_CHARI can match all the subject characters which case-fold to the same equivalence class.)
    match = folded_pat_ch == subj_ch || folded_pat_ch == UCD_OTHERCASE(subj_ch)

This new strategy should be no slower - we're literally doing exactly the same amount of work, both at compile-time and match-time.

  • One case-fold operation on the pattern character, at compile-time, which can use an ASCII fast-path.
    • In fact... at compile-time we don't even need to look up the UCD data at all for ASCII characters, in the new approach, whereas we do in the old approach (to find out whether the ASCII character is one of the ones like k and s with more than one matching character). So the new approach is less work - we'll never ever need to access the UCD tables for ASCII patterns matching ASCII subject strings, even with (*UTF)(*UCP) enabled. That's a small but happy win.
  • Both approaches also have exactly one case-fold operation, at match-time, which can use an ASCII fast-path.

I've done enough of the PR to be fairly sure it'll work. It's certainly "correct". It simplifies the matcher, marginally, and makes the character handling more uniform, which is always a bonus.

I can't make any promises, but I don't expect performance to suffer, because it's the same number of operations, theoretically at least. We'd need to measure to be sure, of course.


So - thoughts?

  • "No, don't bother, we like it how it is."
  • "Interesting, let's see the PR"
  • "Cool, making all the ASCII characters behave the same would be a lovely relief"

Metadata

Metadata

Assignees

No one assigned

    Labels

    untidinessNot exactly a bug, but could do better

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions