-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Labels
Description
As I've encountered many variants of apostrophes, I wonder if pragmatic_tokenizer should normalize these (optionally) by mapping them all to a single character to ensure that they are treated as the same word. E.g.:
["where's", "where´s", "where`s", "where’s", "where's", "where‘s", "where’s", "where‘s", "where’s", "where‛s"]
I've compiled a list of what I found so far:
U+0027 APOSTROPHE
U+0091 LEFT SINGLE QUOTATION MARK
U+0092 RIGHT SINGLE QUOTATION MARK
U+02B9 MODIFIER LETTER PRIME
U+02BB MODIFIER LETTER TURNED COMMA
U+02BC MODIFIER LETTER APOSTROPHE
U+02BD MODIFIER LETTER REVERSED COMMA
U+02BE MODIFIER LETTER RIGHT HALF RING
U+02BF MODIFIER LETTER LEFT HALF RING
U+02C8 MODIFIER LETTER VERTICAL LINE
U+0301 COMBINING ACUTE ACCENT
U+0384 GREEK TONOS
U+1FBF GREEK PSILI
U+A78B LATIN CAPITAL LETTER SALTILLO
U+A78C LATIN SMALL LETTER SALTILLO
U+2018 LEFT SINGLE QUOTATION MARK
U+2019 RIGHT SINGLE QUOTATION MARK
U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
U+2032 PRIME
U+00B4 ACUTE ACCENT
U+0060 GRAVE ACCENT
U+FF40 FULLWIDTH GRAVE ACCENT
U+FF07 FULLWIDTH APOSTROPHE
U+055A ARMENAN APOSTROPHE
U+05F3 HEBREW PUNCTUATION GERESH
I've read that webkit-based browsers merge at least U+0027, U+2018, U+2019, U+FF07, maybe already more than that.
There might be similar issues for other characters.