-
Notifications
You must be signed in to change notification settings - Fork 187
Polish Stemmer v3 #245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Polish Stemmer v3 #245
Conversation
7feb805
to
fb0887a
Compare
Thanks.
Also different for UTF-8 vs wide characters. My immediate thought was Czech which has syllabic consonants, but according to https://en.wikipedia.org/wiki/Polish_language#Consonant_distribution : Unlike languages such as Czech, Polish does not have syllabic consonants – the nucleus of a syllable is always a vowel. Maybe it's worth trying to think about whether there are rules about these consonant clusters we can use though. If you want the cursor two Unicode characters in, then Are you going to be at debcamp/debconf next month? |
I don't think there is a law here.
Thank you, I will give it a try. Number 2 was chosen quite arbitrarily, maybe I should also play with 1 and 3 and look at the results.
Unfortunately, no… |
Fair enough. I noticed Wikipedia says "Polish can have word-initial and word-medial clusters of up to four consonants" but that's probably not helpful unless it's possible to have five or more initial consonants and we should break after four in those cases.
Producing single letter stems, even if linguistically correct, can be problematic - there just aren't very many single letters so unwanted conflation is a bigger risk - e.g. consider discussion of "the letter b", musical notes: "in the key of B minor", chemical elements: "B is the symbol for Boron", itemisation: "In case b, we see [...]", "blood type: B", "vitamin B", "Hepatitis B", people's initials: "B. Wilson", etc. There's of course unwanted conflation in search results between all of these already, but stemming forms of być to b will likely mean searching for them produces a lot more noise and the benefits of conflating the different forms will tend to be significantly reduced if not wiped out entirely. So it might be better to just not conflate forms of such words, or if possible map them to a longer stem. Grouping some forms together on to a longer stem (or onto several longer stems) could also be an option (probably terrible example just to illustrate what I mean: for być, conflating byłem, byłam, byłom, byłyśmy, etc on stem był). |
fb0887a
to
6c2d03a
Compare
I implemented this now, and removed my FIXME comment. |
algorithms/polish.sbl
Outdated
// so we need to process it after we are done with adjectival forms. | ||
setlimit tomark p0 for ([substring]) among ( | ||
'sz{ek}' // present 1st person singular (noszę) | ||
'sz{ak}' // present 3rd person plural (noszą) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can see we can never match 'sz{ak}'
here because it's also handled in adjectival
(with action delete
instead of <- 's'
). Testing the example word here, it gets stemmed to no
not nos
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From on a quick test, it looks like always doing <- 's'
for 'sz{ak}'
is worse than always doing delete
.
Based on the example words in the comments (R1 starts at the |
in nos|zą
and lep|szą
), I tried leaving the s
in place for shorter words:
'sz{ak}'
(R1 and delete or <-'s')
That fixes noszą to stem to nos as the comment suggests it should, but I couldn't tell if that was better overall (it's at least not obviously worse than always removing it) - might be worth you taking a look at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried leaving the s in place for shorter words:
Both adjectives and verbs can be short and long. But I decided to check your heuristic that short words are most probably verbs. I took words matching ^[a-z]{,3}szą$
from the sample vocabulary:
- cieszą — verb
- dalszą — adjective
- duszą — noun
- gorszą — adjective
- gruszą — noun
- kuszą — noun
- lepszą — adjective
- mszą — noun
- muszą — verb
- naszą — pronoun (declines as an adjective)
- noszą — verb
- nowszą — adjective
- pieszą — adjective
- piszą — verb
- proszą — verb
- ruszą — verb
- skuszą — verb
- słyszą — verb
- suszą — noun
- unoszą — verb
- waszą — pronoun (declines as an adjective)
- wnoszą — verb
- zmuszą — verb
- znoszą — verb
(Nouns in this list have sz as part of their root, not as a suffix.)
So, indeed most of the short forms in -shą
will be verbs. But with any such heuristic there will be many false positives.
What I can say for sure is that forms starting with naj- and having -sz- + adjectival ending on the end are definitely superlative forms of adjectives and should not be treated as verbs. But my level of snowball knowledge does not allow me to express this rule (I did some attempts earlier but failed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And a short review of changes after applying the suggested change (R1 and delete or <-'s')
on top of your branch:
source | old | new | part of speech | comment |
---|---|---|---|---|
cieszą | cie | cies | verb | looks good |
duszą | du | dus | noun | ideally should be dusz, but better than before |
gruszą | gru | grus | noun | ideally should be grusz, but better than before |
kuszą | ku | kus | noun | ideally should be kusz, but better than before |
muszą | mu | mus | verb | looks good |
naszą | na | nas | pronoun | ideally should be nasz, but better than before |
noszą | no | nos | verb | looks good |
pieszą | pie | pies | adjective | looks good (not a comparative form, sz is part of the root) |
piszą | pi | pis | verb | looks good |
proszą | pro | pros | verb | looks good |
ruszą | ru | rus | verb | ideally should be rusz, but better than before |
skuszą | sku | skus | verb | looks good |
słyszą | sły | słys | verb | ideally should be słysz, but better than before |
suszą | su | sus | noun or verb | ideally should be susz (in both cases), but better than before |
waszą | wa | was | pronoun | ideally should be wasz, but better than before |
wnoszą | wno | wnos | verb | looks good |
zmuszą | zmu | zmus | verb | looks good |
znoszą | zno | znos | verb | looks good |
I don’t see any downsides of this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I can say for sure is that forms starting with naj- and having -sz- + adjectival ending on the end are definitely superlative forms of adjectives and should not be treated as verbs. But my level of snowball knowledge does not allow me to express this rule
I think you're wanting reverse
which swaps the cursor and the limit and flips the direction. Untested, but maybe something like:
'sz{ak}'
((R1 or reverse 'naj') and delete or <-'s')
Not 100% sure I've inserted that in the right place, but hopefully it shows how you'd use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, except that we set the backwards limit to be two characters from the start of the word at the top level, so with reverse
we will test for naj
there not at the start of the word.
I'll have a think about how best to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was taking a look at this but there don't seem to be any words in polish/voc.txt
which start naj and reach this rule.
And thinking about it, the R1 definition means that for any word starting naj, R1 will start after the j so testing for naj here is redundant. Were you actually suggesting testing for naj somewhere else?
Marking R1 for words from your ^[a-z]{,3}szą$
list:
- cies|zą — verb
- dal|szą — adjective
- dus|zą — noun
- gor|szą — adjective
- grus|zą — noun
- kus|zą — noun
- lep|szą — adjective
- mszą| — noun
- mus|zą — verb
- nas|zą — pronoun (declines as an adjective)
- nos|zą — verb
- now|szą — adjective
- pies|zą — adjective
- pis|zą — verb
- pros|zą — verb
- rus|zą — verb
- skus|zą — verb
- słys|zą — verb
- sus|zą — noun
- un|oszą — verb
- was|zą — pronoun (declines as an adjective)
- wnos|zą — verb
- zmus|zą — verb
- znos|zą — verb
Ideally we delete -szą for an adjective and replace with -s for a verb.
I notice that most of the adjectives have -szą in R1 because they start consonant vowel consonant - e.g. dal|szą. The only exception seems to be pies|zą, so in general the current R1 test is very good there.
It's similarly good for verbs - only un|oszą seems to be mishandled.
If was|zą declines as an adjective then it's handled wrongly, but handling of pronouns is unlikely to be important unless they get conflated with other unrelated words.
IIUC nouns should be handled with replace with -s like verbs and all the nouns here are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was taking a look at this but there don't seem to be any words in
polish/voc.txt
which start naj and reach this rule.
There are some words in the vocabulary that match the ^naj.*sz.*
regex.
I looked at the output and most of them are stemmed correctly:
polish/voc.txt:najpiękniej
polish/voc.txt:najpiękniejsza
polish/voc.txt:najpiękniejszą
polish/voc.txt:najpiękniejsze
polish/voc.txt:najpiękniejszego
polish/voc.txt:najpiękniejszej
polish/voc.txt:najpiękniejszy
polish/voc.txt:najpiękniejszych
polish/voc.txt:najpiękniejszym
polish/output.txt:najpiękn
polish/output.txt:najpiękn
polish/output.txt:najpiękn
polish/output.txt:najpiękn
polish/output.txt:najpiękn
polish/output.txt:najpiękn
polish/output.txt:najpiękn
polish/output.txt:najpiękn
polish/output.txt:najpiękn
or
polish/voc.txt:najbliżej
polish/voc.txt:najbliżsi
polish/voc.txt:najbliższa
polish/voc.txt:najbliższą
polish/voc.txt:najbliższe
polish/voc.txt:najbliższego
polish/voc.txt:najbliższej
polish/voc.txt:najbliższy
polish/voc.txt:najbliższych
polish/voc.txt:najbliższym
polish/voc.txt:najbliższymi
polish/output.txt:najbliż
polish/output.txt:najbliżs
polish/output.txt:najbliż
polish/output.txt:najbliż
polish/output.txt:najbliż
polish/output.txt:najbliż
polish/output.txt:najbliż
polish/output.txt:najbliż
polish/output.txt:najbliż
polish/output.txt:najbliż
polish/output.txt:najbliż
One thing that catches the eye is that we don't handle plural nominative forms where -szy turns into -si, but that is just one form and I think we can ignore it (it will be hard to fix it properly without false positives).
But again, I think we are doing very good here and no special handling of naj-
is needed. Were there any other issues holding this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're happy with the naj- situation, I think the only thing is the website update - we have snowballstem/snowball-website#29 but that's for tomek-ai's version.
I feel that transcribing the algorithm from snowball to English prose is not all that useful. Perhaps it can help find unintended mistakes in the Snowball code, but mostly it gives us two different descriptions of the algorithm, which can then disagree.
What has proved useful is notes about design decisions such as what you chose to do and why (and what you chose not to do and why not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I will try to do it, probably next week.
I wondered if we could actually handle most of the ending removal in one big among, which should be faster - currently This refactor reduced the time taken to stem the vocab list by ~26% for C compared to the version here, while producing the same stems as my initial tweaked version: To merge the noun forms into the same among it seems we have to use an among function to achieve only considering them in R1 (hence the Doing this also helps reveal cases which can't match because they're dominated by cases in earlier checks (I've noted the two I identified in separate comments above). |
Your modifications look good to me, thank you! I merged them into my branch, and made two commits on top of them. |
I've pushed a trivial fix for a warning. |
CI failed to find the polish data for some reason. The previous build passed and this change clearly couldn't cause problems so don't worry about this failure as far as this PR is concerned (I will see if I can work out why it isn't finding the data). |
3402912
to
c0054b6
Compare
Apologies for abusing your PR branch to debug the CI issue but it needs quite particular circumstances which would be tricky for me to recreate alone. Unfortunately I just don't see how to make it work in this case though. We need to know the repo URL and branch name for the PR branch. We can get the branch name, but the repo URL doesn't seem to be anywhere. We can use I'd hoped it'd be in an environment variable or there would be a remote pointing at it, but there doesn't seem to be. The closest I can see is the email address on the (automatically generated) merge commit:
I'm not sure if this address is always just the github user name before the I'll try to reset the branch to a sensible state. |
This avoids producing an empty stem for input `cie`, avoids reducing `liście` to `ł` and avoids removing acute accents from inputs `ć`, `ń`, `ś` and `ź`. It is also a bit more efficient (0.7% faster to stem the sample vocabulary in C).
About 8% faster
~19% time reduction from branch point for stemming sample vocab list in C.
~26% time reduction.
Co-authored-by: Olly Betts <[email protected]>
a2eb25d
to
6429800
Compare
I agree, that is unfortunate. Maybe it's worth to file a bug against GitHub itself about it? For now, I rebased my branch and pushed myself, so the tests should hopefully pass now. |
This PR supersedes #159 and #220.
I rewrote the Polish stemmer almost from scratch. The main differences compared to previous approaches are:
naj-
superlative prefix. This allows me to keep the code cleaner.or
, not usingdo
.I decided to make a new PR, first because this is a rewrite so most of old comments are irrelevant, and second to make it easier to compare this approach with the previous ones.
This PR has only one FIXME comment. I tried to resolve it by making p0 hardcoded (
$p0 = 2
), but that behaves differently in UTF-8 and ISO-8859-2 encodings, so I did not go this way after all.