-
Notifications
You must be signed in to change notification settings - Fork 920
Closed
Labels
Description
Describe the bug
Tokens without a space after them in the original text do not include that info in the misc field of the Word object or in the conllu output format.
To Reproduce
import stanza
from stanza.utils.conll import CoNLL
text = """
A bird hit the car, apparently.
"""
nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma,depparse', package='ewt')
doc = nlp(text)
for sent in doc.sentences:
print(*[f'id: {word.id}\tword: {word.text}\tmisc: {word.misc}' for word in sent.words], sep='\n')
print('\n')
print(CoNLL.doc2conll_text(doc))
Outputs the following:
id: 1 word: A misc: None
id: 2 word: bird misc: None
id: 3 word: hit misc: None
id: 4 word: the misc: None
id: 5 word: car misc: None
id: 6 word: , misc: None
id: 7 word: apparently misc: None
id: 8 word: . misc: None
# text = A bird hit the car, apparently.
# sent_id = 0
1 A a DET DT Definite=Ind|PronType=Art 2 det _ start_char=1|end_char=2
2 bird bird NOUN NN Number=Sing 3 nsubj _ start_char=3|end_char=7
3 hit hit VERB VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 0 root _ start_char=8|end_char=11
4 the the DET DT Definite=Def|PronType=Art 5 det _ start_char=12|end_char=15
5 car car NOUN NN Number=Sing 3 obj _ start_char=16|end_char=19
6 , , PUNCT , _ 3 punct _ start_char=19|end_char=20
7 apparently apparently ADV RB _ 3 advmod _ start_char=21|end_char=31
8 . . PUNCT . _ 3 punct _ start_char=31|end_char=32
Expected behavior
In the example sentence, tokens "car" and "apparently" should have "SpaceAfter=No" in the misc field.
Environment (please complete the following information):
- OS: Windows
- Python version: 3.1
- Stanza version: 1.6.1
Additional context
Indicated here that "SpaceAfter=No" is included in the conllu output format: #677 (comment)