mapping of similar characters (e.g. apostrophes)?

As I've encountered many variants of apostrophes, I wonder if pragmatic_tokenizer should normalize these (optionally) by mapping them all to a single character to ensure that they are treated as the same word. E.g.:

```
["where's", "where´s", "where`s", "where’s", "where's", "where‘s", "where’s", "where‘s", "where’s", "where‛s"]
```

I've compiled a list of what I found so far:

```
U+0027 APOSTROPHE 
U+0091 LEFT SINGLE QUOTATION MARK
U+0092 RIGHT SINGLE QUOTATION MARK
U+02B9 MODIFIER LETTER PRIME
U+02BB MODIFIER LETTER TURNED COMMA
U+02BC MODIFIER LETTER APOSTROPHE
U+02BD MODIFIER LETTER REVERSED COMMA
U+02BE MODIFIER LETTER RIGHT HALF RING
U+02BF MODIFIER LETTER LEFT HALF RING
U+02C8 MODIFIER LETTER VERTICAL LINE
U+0301 COMBINING ACUTE ACCENT
U+0384 GREEK TONOS
U+1FBF GREEK PSILI
U+A78B LATIN CAPITAL LETTER SALTILLO
U+A78C LATIN SMALL LETTER SALTILLO
U+2018 LEFT SINGLE QUOTATION MARK
U+2019 RIGHT SINGLE QUOTATION MARK
U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK 
U+2032 PRIME 
U+00B4 ACUTE ACCENT 
U+0060 GRAVE ACCENT 
U+FF40 FULLWIDTH GRAVE ACCENT 
U+FF07 FULLWIDTH APOSTROPHE
U+055A ARMENAN APOSTROPHE
U+05F3 HEBREW PUNCTUATION GERESH
```

I've read that webkit-based browsers merge at least U+0027, U+2018, U+2019, U+FF07, maybe already more than that.

There might be similar issues for other characters.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mapping of similar characters (e.g. apostrophes)? #23

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

mapping of similar characters (e.g. apostrophes)? #23

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions