-
Notifications
You must be signed in to change notification settings - Fork 931
Closed
Labels
Description
Describe the bug
It seems like some tokens with apostrophe causing some misalignment in word-level's text.
To Reproduce
import stanza
nlp = stanza.Pipeline("en", processors='tokenize')nlp("It is mainly the receptionist's responsibility to take hotel guests to the evacuation.")
[
[
{
"id": 1,
"text": "It",
"start_char": 0,
"end_char": 2
},
{
"id": 2,
"text": "is",
"start_char": 3,
"end_char": 5
},
{
"id": 3,
"text": "mainly",
"start_char": 6,
"end_char": 12
},
{
"id": 4,
"text": "the",
"start_char": 13,
"end_char": 16
},
{
"id": [
5,
6
],
"text": "receptionist's",
"start_char": 17,
"end_char": 31
},
{
"id": 5,
"text": "receptionstst" // weird text
},
{
"id": 6,
"text": "'s"
},
{
"id": 7,
"text": "responsibility",
"start_char": 32,
"end_char": 46
},
{
"id": 8,
"text": "to",
"start_char": 47,
"end_char": 49
},
{
"id": 9,
"text": "take",
"start_char": 50,
"end_char": 54
},
{
"id": 10,
"text": "hotel",
"start_char": 55,
"end_char": 60
},
{
"id": 11,
"text": "guests",
"start_char": 61,
"end_char": 67
},
{
"id": 12,
"text": "to",
"start_char": 68,
"end_char": 70
},
{
"id": 13,
"text": "the",
"start_char": 71,
"end_char": 74
},
{
"id": 14,
"text": "evacuation",
"start_char": 75,
"end_char": 85,
"misc": "SpaceAfter=No"
},
{
"id": 15,
"text": ".",
"start_char": 85,
"end_char": 86,
"misc": "SpaceAfter=No"
}
]
]I have collected a few more examples.
nlp("I went to Björkängshallen's square.")
{
"id": [
4,
5
],
"text": "Björkängshallen's",
"start_char": 10,
"end_char": 27
},
{
"id": 4,
"text": "Bj<UNK>rkk<UNK>ngsshsn" // if no apostrophe, the token level seems to find correct text
},
{
"id": 5,
"text": "'s"
}nlp("It happend at a Stockholm municipality's nursery")
{
"id": [
6,
7
],
"text": "municipality's",
"start_char": 26,
"end_char": 40
},
{
"id": 6,
"text": "municipaltity" // weird text again
},
{
"id": 7,
"text": "'s"
},nlp("My establishment's meaning")
{
"id": [
2,
3
],
"text": "establishment's",
"start_char": 3,
"end_char": 18
},
{
"id": 2,
"text": "estabblismentn" // wrong word text
},
{
"id": 3,
"text": "'s"
},nlp("We went to the Drottningholm's festival")
{
"id": [
5,
6
],
"text": "Drottningholm's",
"start_char": 15,
"end_char": 30
},
{
"id": 5,
"text": "Drottniingommm" // wrong word text
},
{
"id": 6,
"text": "'s"
}nlp("The ad on the newspaper's pages")
{
"id": [
5,
6
],
"text": "newspaper's",
"start_char": 14,
"end_char": 25
},
{
"id": 5,
"text": "newspapper" // wrong word text
},
{
"id": 6,
"text": "'s"
}Expected behavior
I think receptionist'stoken should be receptionist AND 's in word level, not receptionstst AND 's
As for the "I went to Björkängshallen's square." sentence, both wrong text and <UNK> appeared. But I would expect this:
{
"id": 4,
"text": "Björkängshallen"
},
{
"id": 5,
"text": "'s"
}Environment (please complete the following information):
- OS: MacOS
- Python version: 3.11
- Stanza version: 1.8.2
- English tokenize model : combined
Additional context
I saw this issue #1361 Maybe it is related to this?