Skip to content

Contractions don't remove dots #41

@sheerun

Description

@sheerun
tokenizer =PragmaticTokenizer::Tokenizer.new({
  language: :pl,
  numbers: :all,
  downcase: false,
  contractions: { "os" => "osiedle", "os." => "osiedle" },
  expand_contractions: true
})

puts tokenizer.tokenize("Na os.Piłsudskiego")

The proper tokenization should be

  • ["Na", "osiedle", "Piłsudskiego"]
    while tokenizer returns
  • ["Na", "Osiedle", ".", "Piłsudskiego"]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions