Skip to content

"SpaceAfter=No" not being included in misc field of Word objects #1315

@tomlup

Description

@tomlup

Describe the bug
Tokens without a space after them in the original text do not include that info in the misc field of the Word object or in the conllu output format.

To Reproduce

import stanza
from stanza.utils.conll import CoNLL

text = """
A bird hit the car, apparently.
"""

nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma,depparse', package='ewt')
doc = nlp(text)
for sent in doc.sentences:
    print(*[f'id: {word.id}\tword: {word.text}\tmisc: {word.misc}' for word in sent.words], sep='\n')
    print('\n')
print(CoNLL.doc2conll_text(doc))

Outputs the following:

id: 1	word: A	misc: None
id: 2	word: bird	misc: None
id: 3	word: hit	misc: None
id: 4	word: the	misc: None
id: 5	word: car	misc: None
id: 6	word: ,	misc: None
id: 7	word: apparently	misc: None
id: 8	word: .	misc: None


# text = A bird hit the car, apparently.
# sent_id = 0
1	A	a	DET	DT	Definite=Ind|PronType=Art	2	det	_	start_char=1|end_char=2
2	bird	bird	NOUN	NN	Number=Sing	3	nsubj	_	start_char=3|end_char=7
3	hit	hit	VERB	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	0	root	_	start_char=8|end_char=11
4	the	the	DET	DT	Definite=Def|PronType=Art	5	det	_	start_char=12|end_char=15
5	car	car	NOUN	NN	Number=Sing	3	obj	_	start_char=16|end_char=19
6	,	,	PUNCT	,	_	3	punct	_	start_char=19|end_char=20
7	apparently	apparently	ADV	RB	_	3	advmod	_	start_char=21|end_char=31
8	.	.	PUNCT	.	_	3	punct	_	start_char=31|end_char=32

Expected behavior
In the example sentence, tokens "car" and "apparently" should have "SpaceAfter=No" in the misc field.

Environment (please complete the following information):

  • OS: Windows
  • Python version: 3.1
  • Stanza version: 1.6.1

Additional context
Indicated here that "SpaceAfter=No" is included in the conllu output format: #677 (comment)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions