Skip to content

Apostrophe bug for Stanza English tokenizer model #1417

@sujoung

Description

@sujoung

Describe the bug
It seems like some tokens with apostrophe causing some misalignment in word-level's text.

To Reproduce

import stanza
nlp = stanza.Pipeline("en", processors='tokenize')

nlp("It is mainly the receptionist's responsibility to take hotel guests to the evacuation.")

[
  [
    {
      "id": 1,
      "text": "It",
      "start_char": 0,
      "end_char": 2
    },
    {
      "id": 2,
      "text": "is",
      "start_char": 3,
      "end_char": 5
    },
    {
      "id": 3,
      "text": "mainly",
      "start_char": 6,
      "end_char": 12
    },
    {
      "id": 4,
      "text": "the",
      "start_char": 13,
      "end_char": 16
    },
    {
      "id": [
        5,
        6
      ],
      "text": "receptionist's",
      "start_char": 17,
      "end_char": 31
    },
    {
      "id": 5,
      "text": "receptionstst" // weird text
    },
    {
      "id": 6,
      "text": "'s"
    },
    {
      "id": 7,
      "text": "responsibility",
      "start_char": 32,
      "end_char": 46
    },
    {
      "id": 8,
      "text": "to",
      "start_char": 47,
      "end_char": 49
    },
    {
      "id": 9,
      "text": "take",
      "start_char": 50,
      "end_char": 54
    },
    {
      "id": 10,
      "text": "hotel",
      "start_char": 55,
      "end_char": 60
    },
    {
      "id": 11,
      "text": "guests",
      "start_char": 61,
      "end_char": 67
    },
    {
      "id": 12,
      "text": "to",
      "start_char": 68,
      "end_char": 70
    },
    {
      "id": 13,
      "text": "the",
      "start_char": 71,
      "end_char": 74
    },
    {
      "id": 14,
      "text": "evacuation",
      "start_char": 75,
      "end_char": 85,
      "misc": "SpaceAfter=No"
    },
    {
      "id": 15,
      "text": ".",
      "start_char": 85,
      "end_char": 86,
      "misc": "SpaceAfter=No"
    }
  ]
]

I have collected a few more examples.

  • nlp("I went to Björkängshallen's square.")
{
      "id": [
        4,
        5
      ],
      "text": "Björkängshallen's",
      "start_char": 10,
      "end_char": 27
    },
    {
      "id": 4,
      "text": "Bj<UNK>rkk<UNK>ngsshsn" // if no apostrophe, the token level seems to find correct text
    },
    {
      "id": 5,
      "text": "'s"
    }
  • nlp("It happend at a Stockholm municipality's nursery")
{
      "id": [
        6,
        7
      ],
      "text": "municipality's",
      "start_char": 26,
      "end_char": 40
    },
    {
      "id": 6,
      "text": "municipaltity"  // weird text again
    },
    {
      "id": 7,
      "text": "'s"
    },
  • nlp("My establishment's meaning")
{
      "id": [
        2,
        3
      ],
      "text": "establishment's",
      "start_char": 3,
      "end_char": 18
    },
    {
      "id": 2,
      "text": "estabblismentn" // wrong word text
    },
    {
      "id": 3,
      "text": "'s"
    },
  • nlp("We went to the Drottningholm's festival")
{
      "id": [
        5,
        6
      ],
      "text": "Drottningholm's",
      "start_char": 15,
      "end_char": 30
    },
    {
      "id": 5,
      "text": "Drottniingommm"  // wrong word text
    },
    {
      "id": 6,
      "text": "'s"
    }
  • nlp("The ad on the newspaper's pages")
{
      "id": [
        5,
        6
      ],
      "text": "newspaper's",
      "start_char": 14,
      "end_char": 25
    },
    {
      "id": 5,
      "text": "newspapper" // wrong word text
    },
    {
      "id": 6,
      "text": "'s"
    }

Expected behavior
I think receptionist'stoken should be receptionist AND 's in word level, not receptionstst AND 's
As for the "I went to Björkängshallen's square." sentence, both wrong text and <UNK> appeared. But I would expect this:

{
      "id": 4,
      "text": "Björkängshallen" 
    },
    {
      "id": 5,
      "text": "'s"
    }

Environment (please complete the following information):

  • OS: MacOS
  • Python version: 3.11
  • Stanza version: 1.8.2
  • English tokenize model : combined

Additional context
I saw this issue #1361 Maybe it is related to this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions