Skip to content

Conversation

AngledLuffa
Copy link
Collaborator

Potentially lowercase the data in a lemmatizer if all of the training data (or a user flag) requested it

Testing additions:

Add a basic unit test of the all_lowercase function Add a test of the caseless lemmatizer in the Pipeline Test that the Latin ITTB lemmatizer is marked as caseless. Check that the results for capitalized text is as expected

Addresses #1330

… data (or a user flag) requested it

Testing additions:

Add a basic unit test of the all_lowercase function
Add a test of the caseless lemmatizer in the Pipeline
Test that the Latin ITTB lemmatizer is marked as caseless.  Check that the results for capitalized text is as expected

Addresses #1330
@AngledLuffa AngledLuffa force-pushed the dev branch 2 times, most recently from cd3eeeb to 20a12c0 Compare January 19, 2024 06:27
"""
nlp = stanza.Pipeline('la', package='ittb', processors='tokenize,pos,lemma', model_dir=TEST_MODELS_DIR, download_method=None)
lemmatizer = nlp.processors['lemma']
assert lemmatizer.config['caseless']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the published models have this flag already, and it just poofed? Otherwise don't we need to release new models with this?

Lgtm otherwise

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I redid the Latin ITTB model with this flag. It's the only one so far, but all of the others should automatically pick it up if their entire dataset is lowercased. I do wonder if that will be surprising in the case of a Chinese model which accidentally gets used on some English text, though, for example

@AngledLuffa AngledLuffa merged commit 3aeea46 into dev Feb 3, 2024
@AngledLuffa AngledLuffa deleted the lowercase_lemmas branch February 3, 2024 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants