Skip to content

mapping of similar characters (e.g. apostrophes)? #23

@maia

Description

@maia

As I've encountered many variants of apostrophes, I wonder if pragmatic_tokenizer should normalize these (optionally) by mapping them all to a single character to ensure that they are treated as the same word. E.g.:

["where's", "where´s", "where`s", "where’s", "where's", "where‘s", "where’s", "where‘s", "where’s", "where‛s"]

I've compiled a list of what I found so far:

U+0027 APOSTROPHE 
U+0091 LEFT SINGLE QUOTATION MARK
U+0092 RIGHT SINGLE QUOTATION MARK
U+02B9 MODIFIER LETTER PRIME
U+02BB MODIFIER LETTER TURNED COMMA
U+02BC MODIFIER LETTER APOSTROPHE
U+02BD MODIFIER LETTER REVERSED COMMA
U+02BE MODIFIER LETTER RIGHT HALF RING
U+02BF MODIFIER LETTER LEFT HALF RING
U+02C8 MODIFIER LETTER VERTICAL LINE
U+0301 COMBINING ACUTE ACCENT
U+0384 GREEK TONOS
U+1FBF GREEK PSILI
U+A78B LATIN CAPITAL LETTER SALTILLO
U+A78C LATIN SMALL LETTER SALTILLO
U+2018 LEFT SINGLE QUOTATION MARK
U+2019 RIGHT SINGLE QUOTATION MARK
U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK 
U+2032 PRIME 
U+00B4 ACUTE ACCENT 
U+0060 GRAVE ACCENT 
U+FF40 FULLWIDTH GRAVE ACCENT 
U+FF07 FULLWIDTH APOSTROPHE
U+055A ARMENAN APOSTROPHE
U+05F3 HEBREW PUNCTUATION GERESH

I've read that webkit-based browsers merge at least U+0027, U+2018, U+2019, U+FF07, maybe already more than that.

There might be similar issues for other characters.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions