-
Notifications
You must be signed in to change notification settings - Fork 31.2k
Closed
Labels
Description
🐛 Bug
Information
"mrm8488/bert-spanish-cased-finetuned-ner"
Language I am using the model on (English, Chinese ...): Spanish
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- create a
nerpipeline - pass flag
grouped_entities - entities are not grouped as expected see sample below
NER_MODEL = "mrm8488/bert-spanish-cased-finetuned-ner"
nlp_ner = pipeline("ner", model=NER_MODEL,
grouped_entities=True,
tokenizer=(NER_MODEL, {"use_fast": False}))
t = """Consuelo Araújo Noguera, ministra de cultura del presidente Andrés Pastrana (1998.2002) fue asesinada por las Farc luego de haber permanecido secuestrada por algunos meses."""
ner(t)
>>>
[ {'entity_group': 'B-PER', 'score': 0.901019960641861, 'word': 'Consuelo'},
{'entity_group': 'I-PER', 'score': 0.9990904808044434, 'word': 'Araújo Noguera'},
{'entity_group': 'B-PER', 'score': 0.9998136162757874, 'word': 'Andrés'},
{'entity_group': 'I-PER', 'score': 0.9996985991795858, 'word': 'Pastrana'},
{'entity_group': 'B-ORG', 'score': 0.9989739060401917, 'word': 'Far'}]Expected behavior
Inconsistent grouping
I expect the first two items of the given sample( B-PER, and I-PER) to be grouped. As they are contiguous tokens and correspond to a single entity spot. It seems the current code does not take into account B and I tokens.
expected output:
{'entity_group': 'I-PER', 'score': 0.9990904808044434, 'word': ' Consuelo Araújo Noguera'},
{'entity_group': 'I-PER', 'score': 0.9998136162757874, 'word': 'Andrés Pastrana'},
{'entity_group': 'B-ORG', 'score': 0.9989739060401917, 'word': 'Farc'}]
Lost tokens?
for the same input, passing grouped_entities=False generates the following output:
[
{'word': 'Cons', 'score': 0.9994944930076599, 'entity': 'B-PER', 'index': 1},
{'word': '##uelo', 'score': 0.802545428276062, 'entity': 'B-PER', 'index': 2},
{'word': 'Ara', 'score': 0.9993102550506592, 'entity': 'I-PER', 'index': 3},
{'word': '##új', 'score': 0.9993743896484375, 'entity': 'I-PER', 'index': 4},
{'word': '##o', 'score': 0.9992871880531311, 'entity': 'I-PER', 'index': 5},
{'word': 'No', 'score': 0.9993029236793518, 'entity': 'I-PER', 'index': 6},
{'word': '##guera', 'score': 0.9981776475906372, 'entity': 'I-PER', 'index': 7},
{'word': 'Andrés', 'score': 0.9998136162757874, 'entity': 'B-PER', 'index': 15},
{'word': 'Pas', 'score': 0.999740719795227, 'entity': 'I-PER', 'index': 16},
{'word': '##tran', 'score': 0.9997414350509644, 'entity': 'I-PER', 'index': 17},
{'word': '##a', 'score': 0.9996136426925659, 'entity': 'I-PER', 'index': 18},
{'word': 'Far', 'score': 0.9989739060401917, 'entity': 'B-ORG', 'index': 28},
{'word': '##c', 'score': 0.7188423275947571, 'entity': 'I-ORG', 'index': 29}]
when using grouped_entities the last entity word (##c) got lost, it is not even considered as a different group
{'entity_group': 'B-ORG', 'score': 0.9989739060401917, 'word': 'Far'}]
Environment info
transformersversion: 2.11.0- Platform: OSX
- Python version: 3.7
- PyTorch version (GPU?): 1.5.0
- Tensorflow version (GPU?):
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no