Skip to content

NER pipeline: Inconsistent entity grouping #4816

@dav009

Description

@dav009

🐛 Bug

Information

"mrm8488/bert-spanish-cased-finetuned-ner"

Language I am using the model on (English, Chinese ...): Spanish

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. create a ner pipeline
  2. pass flag grouped_entities
  3. entities are not grouped as expected see sample below
NER_MODEL = "mrm8488/bert-spanish-cased-finetuned-ner"
nlp_ner = pipeline("ner", model=NER_MODEL,
                   grouped_entities=True,
                   tokenizer=(NER_MODEL, {"use_fast": False}))

t = """Consuelo Araújo Noguera, ministra de cultura del presidente Andrés Pastrana (1998.2002) fue asesinada por las Farc luego de haber permanecido secuestrada por algunos meses."""
ner(t)
>>> 
[ {'entity_group': 'B-PER', 'score': 0.901019960641861, 'word': 'Consuelo'}, 
 {'entity_group': 'I-PER', 'score': 0.9990904808044434, 'word': 'Araújo Noguera'}, 
 {'entity_group': 'B-PER', 'score': 0.9998136162757874, 'word': 'Andrés'}, 
 {'entity_group': 'I-PER', 'score': 0.9996985991795858, 'word': 'Pastrana'}, 
 {'entity_group': 'B-ORG', 'score': 0.9989739060401917, 'word': 'Far'}]

Expected behavior

Inconsistent grouping

I expect the first two items of the given sample( B-PER, and I-PER) to be grouped. As they are contiguous tokens and correspond to a single entity spot. It seems the current code does not take into account B and I tokens.

expected output:

 {'entity_group': 'I-PER', 'score': 0.9990904808044434, 'word': ' Consuelo Araújo Noguera'}, 
 {'entity_group': 'I-PER', 'score': 0.9998136162757874, 'word': 'Andrés Pastrana'}, 
 {'entity_group': 'B-ORG', 'score': 0.9989739060401917, 'word': 'Farc'}]

Lost tokens?

for the same input, passing grouped_entities=False generates the following output:

[
{'word': 'Cons', 'score': 0.9994944930076599, 'entity': 'B-PER', 'index': 1},
{'word': '##uelo', 'score': 0.802545428276062, 'entity': 'B-PER', 'index': 2}, 
{'word': 'Ara', 'score': 0.9993102550506592, 'entity': 'I-PER', 'index': 3}, 
{'word': '##új', 'score': 0.9993743896484375, 'entity': 'I-PER', 'index': 4}, 
{'word': '##o', 'score': 0.9992871880531311, 'entity': 'I-PER', 'index': 5}, 
{'word': 'No', 'score': 0.9993029236793518, 'entity': 'I-PER', 'index': 6}, 
{'word': '##guera', 'score': 0.9981776475906372, 'entity': 'I-PER', 'index': 7}, 
{'word': 'Andrés', 'score': 0.9998136162757874, 'entity': 'B-PER', 'index': 15}, 
{'word': 'Pas', 'score': 0.999740719795227, 'entity': 'I-PER', 'index': 16}, 
{'word': '##tran', 'score': 0.9997414350509644, 'entity': 'I-PER', 'index': 17}, 
{'word': '##a', 'score': 0.9996136426925659, 'entity': 'I-PER', 'index': 18}, 
{'word': 'Far', 'score': 0.9989739060401917, 'entity': 'B-ORG', 'index': 28}, 
{'word': '##c', 'score': 0.7188423275947571, 'entity': 'I-ORG', 'index': 29}]

when using grouped_entities the last entity word (##c) got lost, it is not even considered as a different group

{'entity_group': 'B-ORG', 'score': 0.9989739060401917, 'word': 'Far'}]

Environment info

  • transformers version: 2.11.0
  • Platform: OSX
  • Python version: 3.7
  • PyTorch version (GPU?): 1.5.0
  • Tensorflow version (GPU?):
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions