Apostrophe bug for Stanza English tokenizer model

**Describe the bug**
It seems like some tokens with apostrophe causing some misalignment in word-level's text.

**To Reproduce**

```python
import stanza
nlp = stanza.Pipeline("en", processors='tokenize')
``` 
`nlp("It is mainly the receptionist's responsibility to take hotel guests to the evacuation.")`

```javascript

[
  [
    {
      "id": 1,
      "text": "It",
      "start_char": 0,
      "end_char": 2
    },
    {
      "id": 2,
      "text": "is",
      "start_char": 3,
      "end_char": 5
    },
    {
      "id": 3,
      "text": "mainly",
      "start_char": 6,
      "end_char": 12
    },
    {
      "id": 4,
      "text": "the",
      "start_char": 13,
      "end_char": 16
    },
    {
      "id": [
        5,
        6
      ],
      "text": "receptionist's",
      "start_char": 17,
      "end_char": 31
    },
    {
      "id": 5,
      "text": "receptionstst" // weird text
    },
    {
      "id": 6,
      "text": "'s"
    },
    {
      "id": 7,
      "text": "responsibility",
      "start_char": 32,
      "end_char": 46
    },
    {
      "id": 8,
      "text": "to",
      "start_char": 47,
      "end_char": 49
    },
    {
      "id": 9,
      "text": "take",
      "start_char": 50,
      "end_char": 54
    },
    {
      "id": 10,
      "text": "hotel",
      "start_char": 55,
      "end_char": 60
    },
    {
      "id": 11,
      "text": "guests",
      "start_char": 61,
      "end_char": 67
    },
    {
      "id": 12,
      "text": "to",
      "start_char": 68,
      "end_char": 70
    },
    {
      "id": 13,
      "text": "the",
      "start_char": 71,
      "end_char": 74
    },
    {
      "id": 14,
      "text": "evacuation",
      "start_char": 75,
      "end_char": 85,
      "misc": "SpaceAfter=No"
    },
    {
      "id": 15,
      "text": ".",
      "start_char": 85,
      "end_char": 86,
      "misc": "SpaceAfter=No"
    }
  ]
]

``` 
 I have collected a few more examples.

- `nlp("I went to Björkängshallen's square.")`
```javascript
{
      "id": [
        4,
        5
      ],
      "text": "Björkängshallen's",
      "start_char": 10,
      "end_char": 27
    },
    {
      "id": 4,
      "text": "Bj<UNK>rkk<UNK>ngsshsn" // if no apostrophe, the token level seems to find correct text
    },
    {
      "id": 5,
      "text": "'s"
    }
```

- `nlp("It happend at a Stockholm municipality's nursery")`
```javascript
{
      "id": [
        6,
        7
      ],
      "text": "municipality's",
      "start_char": 26,
      "end_char": 40
    },
    {
      "id": 6,
      "text": "municipaltity"  // weird text again
    },
    {
      "id": 7,
      "text": "'s"
    },
```

- `nlp("My establishment's meaning")`
```javascript
{
      "id": [
        2,
        3
      ],
      "text": "establishment's",
      "start_char": 3,
      "end_char": 18
    },
    {
      "id": 2,
      "text": "estabblismentn" // wrong word text
    },
    {
      "id": 3,
      "text": "'s"
    },
```
- `nlp("We went to the Drottningholm's festival")`
```javascript
{
      "id": [
        5,
        6
      ],
      "text": "Drottningholm's",
      "start_char": 15,
      "end_char": 30
    },
    {
      "id": 5,
      "text": "Drottniingommm"  // wrong word text
    },
    {
      "id": 6,
      "text": "'s"
    }
```
- `nlp("The ad on the newspaper's pages")`
```javascript
{
      "id": [
        5,
        6
      ],
      "text": "newspaper's",
      "start_char": 14,
      "end_char": 25
    },
    {
      "id": 5,
      "text": "newspapper" // wrong word text
    },
    {
      "id": 6,
      "text": "'s"
    }
```

**Expected behavior**
I think `receptionist's`token should be `receptionist` AND `'s` in word level, not `receptionstst` AND `'s` 
As for the `"I went to Björkängshallen's square."` sentence, both wrong text and `<UNK>` appeared. But I would expect this:
```javascript
{
      "id": 4,
      "text": "Björkängshallen" 
    },
    {
      "id": 5,
      "text": "'s"
    }
```


**Environment (please complete the following information):**
 - OS: MacOS
 - Python version: 3.11
 - Stanza version: 1.8.2
 - English tokenize model : combined

**Additional context**
I saw this issue #1361 Maybe it is related to this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Apostrophe bug for Stanza English tokenizer model #1417

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Apostrophe bug for Stanza English tokenizer model #1417

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions