-
Notifications
You must be signed in to change notification settings - Fork 924
Potentially lowercase the data in a lemmatizer if all of the training… #1331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Parent:
Update tomli version requirement
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| import pytest | ||
|
|
||
| from stanza.models.lemmatizer import all_lowercase | ||
| from stanza.utils.conll import CoNLL | ||
|
|
||
| LATIN_CONLLU = """ | ||
| # sent_id = train-s1 | ||
| # text = unde et philosophus dicit felicitatem esse operationem perfectam. | ||
| # reference = ittb-scg-s4203 | ||
| 1 unde unde ADV O4 AdvType=Loc|PronType=Rel 4 advmod:lmod _ _ | ||
| 2 et et CCONJ O4 _ 3 advmod:emph _ _ | ||
| 3 philosophus philosophus NOUN B1|grn1|casA|gen1 Case=Nom|Gender=Masc|InflClass=IndEurO|Number=Sing 4 nsubj _ _ | ||
| 4 dicit dico VERB N3|modA|tem1|gen6 Aspect=Imp|InflClass=LatX|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 0 root _ TraditionalMood=Indicativus|TraditionalTense=Praesens | ||
| 5 felicitatem felicitas NOUN C1|grn1|casD|gen2 Case=Acc|Gender=Fem|InflClass=IndEurX|Number=Sing 7 nsubj _ _ | ||
| 6 esse sum AUX N3|modH|tem1 Aspect=Imp|Tense=Pres|VerbForm=Inf 7 cop _ _ | ||
| 7 operationem operatio NOUN C1|grn1|casD|gen2|vgr1 Case=Acc|Gender=Fem|InflClass=IndEurX|Number=Sing 4 ccomp _ _ | ||
| 8 perfectam perfectus ADJ A1|grn1|casD|gen2 Case=Acc|Gender=Fem|InflClass=IndEurA|Number=Sing 7 amod _ SpaceAfter=No | ||
| 9 . . PUNCT Punc _ 4 punct _ _ | ||
|
|
||
| # sent_id = train-s2 | ||
| # text = perfectio autem operationis dependet ex quatuor. | ||
| # reference = ittb-scg-s4204 | ||
| 1 perfectio perfectio NOUN C1|grn1|casA|gen2 Case=Nom|Gender=Fem|InflClass=IndEurX|Number=Sing 4 nsubj _ _ | ||
| 2 autem autem PART O4 _ 4 discourse _ _ | ||
| 3 operationis operatio NOUN C1|grn1|casB|gen2|vgr1 Case=Gen|Gender=Fem|InflClass=IndEurX|Number=Sing 1 nmod _ _ | ||
| 4 dependet dependeo VERB K3|modA|tem1|gen6 Aspect=Imp|InflClass=LatE|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 0 root _ TraditionalMood=Indicativus|TraditionalTense=Praesens | ||
| 5 ex ex ADP S4|vgr2 _ 6 case _ _ | ||
| 6 quatuor quattuor NUM G1|gen3|vgr1 NumForm=Word|NumType=Card 4 obl:arg _ SpaceAfter=No | ||
| 7 . . PUNCT Punc _ 4 punct _ _ | ||
| """.lstrip() | ||
|
|
||
| ENG_CONLLU = """ | ||
| # sent_id = weblog-juancole.com_juancole_20051126063000_ENG_20051126_063000-0007 | ||
| # text = You wonder if he was manipulating the market with his bombing targets. | ||
| 1 You you PRON PRP Case=Nom|Person=2|PronType=Prs 2 nsubj 2:nsubj _ | ||
| 2 wonder wonder VERB VBP Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin 0 root 0:root _ | ||
| 3 if if SCONJ IN _ 6 mark 6:mark _ | ||
| 4 he he PRON PRP Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs 6 nsubj 6:nsubj _ | ||
| 5 was be AUX VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 6 aux 6:aux _ | ||
| 6 manipulating manipulate VERB VBG Tense=Pres|VerbForm=Part 2 ccomp 2:ccomp _ | ||
| 7 the the DET DT Definite=Def|PronType=Art 8 det 8:det _ | ||
| 8 market market NOUN NN Number=Sing 6 obj 6:obj _ | ||
| 9 with with ADP IN _ 12 case 12:case _ | ||
| 10 his his PRON PRP$ Case=Gen|Gender=Masc|Number=Sing|Person=3|Poss=Yes|PronType=Prs 12 nmod:poss 12:nmod:poss _ | ||
| 11 bombing bombing NOUN NN Number=Sing 12 compound 12:compound _ | ||
| 12 targets target NOUN NNS Number=Plur 6 obl 6:obl:with SpaceAfter=No | ||
| 13 . . PUNCT . _ 2 punct 2:punct _ | ||
| """.lstrip() | ||
|
|
||
|
|
||
| def test_all_lowercase(): | ||
| doc = CoNLL.conll2doc(input_str=LATIN_CONLLU) | ||
| assert all_lowercase(doc) | ||
|
|
||
| def test_not_all_lowercase(): | ||
| doc = CoNLL.conll2doc(input_str=ENG_CONLLU) | ||
| assert not all_lowercase(doc) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the published models have this flag already, and it just poofed? Otherwise don't we need to release new models with this?
Lgtm otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I redid the Latin ITTB model with this flag. It's the only one so far, but all of the others should automatically pick it up if their entire dataset is lowercased. I do wonder if that will be surprising in the case of a Chinese model which accidentally gets used on some English text, though, for example