Skip to content

Conversation

mitya57
Copy link
Contributor

@mitya57 mitya57 commented Jun 15, 2025

This PR supersedes #159 and #220.

I rewrote the Polish stemmer almost from scratch. The main differences compared to previous approaches are:

  • I ignore derivational suffixes and focus on inflectional suffixes. Also I ignore the naj- superlative prefix. This allows me to keep the code cleaner.
  • The list of endings is more systematic and ordered. I tried to cover as many different forms and declension/conjugation patterns as possible.
  • The functions are grouped using or, not using do.
  • Added a new step to remove trailing kreska, which indicates softness of consonant (like Russian ь) and is missing in most oblique cases because they have other letters on the end.

I decided to make a new PR, first because this is a rewrite so most of old comments are irrelevant, and second to make it easier to compare this approach with the previous ones.

This PR has only one FIXME comment. I tried to resolve it by making p0 hardcoded ($p0 = 2), but that behaves differently in UTF-8 and ISO-8859-2 encodings, so I did not go this way after all.

@ojwb
Copy link
Member

ojwb commented Jun 15, 2025

Thanks.

This PR has only one FIXME comment. I tried to resolve it by making p0 hardcoded ($p0 = 2), but that behaves differently in UTF-8 and ISO-8859-2 encodings, so I did not go this way after all.

Also different for UTF-8 vs wide characters.

My immediate thought was Czech which has syllabic consonants, but according to https://en.wikipedia.org/wiki/Polish_language#Consonant_distribution : Unlike languages such as Czech, Polish does not have syllabic consonants – the nucleus of a syllable is always a vowel.

Maybe it's worth trying to think about whether there are rules about these consonant clusters we can use though.

If you want the cursor two Unicode characters in, then next 2 gives that. E.g. see mark_regions in german.sbl which enforces a minimum of 3 characters before p1.

Are you going to be at debcamp/debconf next month?

@mitya57
Copy link
Contributor Author

mitya57 commented Jun 15, 2025

Maybe it's worth trying to think about whether there are rules about these consonant clusters we can use though.

I don't think there is a law here.

  • chcieć and brać used to have a yer sound, which was regularly dropped. An alternative form chocieć existed for some time, though.
  • But not every consonant cluster arose this way: for example, grać is a clipped version of igrać, and did not have a yer.
  • Verbs like mieć, być, pić, wyć, ryć, żyć do not have any clusters at all, just one consonant. There are not many such verbs, though (these are all that I managed to come up with). Edit: after further thinking I can also name myć and nyć.

If you want the cursor two Unicode characters in, then next 2 gives that.

Thank you, I will give it a try. Number 2 was chosen quite arbitrarily, maybe I should also play with 1 and 3 and look at the results.

Are you going to be at debcamp/debconf next month?

Unfortunately, no…

@ojwb
Copy link
Member

ojwb commented Jun 15, 2025

Maybe it's worth trying to think about whether there are rules about these consonant clusters we can use though.

I don't think there is a law here.

Fair enough. I noticed Wikipedia says "Polish can have word-initial and word-medial clusters of up to four consonants" but that's probably not helpful unless it's possible to have five or more initial consonants and we should break after four in those cases.

  • Verbs like mieć, być, pić, wyć, ryć, żyć do not have any clusters at all, just one consonant. There are not many such verbs, though (these are all that I managed to come up with). Edit: after further thinking I can also name myć and nyć.

Producing single letter stems, even if linguistically correct, can be problematic - there just aren't very many single letters so unwanted conflation is a bigger risk - e.g. consider discussion of "the letter b", musical notes: "in the key of B minor", chemical elements: "B is the symbol for Boron", itemisation: "In case b, we see [...]", "blood type: B", "vitamin B", "Hepatitis B", people's initials: "B. Wilson", etc.

There's of course unwanted conflation in search results between all of these already, but stemming forms of być to b will likely mean searching for them produces a lot more noise and the benefits of conflating the different forms will tend to be significantly reduced if not wiped out entirely.

So it might be better to just not conflate forms of such words, or if possible map them to a longer stem. Grouping some forms together on to a longer stem (or onto several longer stems) could also be an option (probably terrible example just to illustrate what I mean: for być, conflating byłem, byłam, byłom, byłyśmy, etc on stem był).

@mitya57 mitya57 force-pushed the polish-stemmer-3 branch from fb0887a to 6c2d03a Compare June 16, 2025 15:53
@mitya57
Copy link
Contributor Author

mitya57 commented Jun 16, 2025

So it might be better to just not conflate forms of such words, or if possible map them to a longer stem. Grouping some forms together on to a longer stem (or onto several longer stems) could also be an option (probably terrible example just to illustrate what I mean: for być, conflating byłem, byłam, byłom, byłyśmy, etc on stem był).

I implemented this now, and removed my FIXME comment.

// so we need to process it after we are done with adjectival forms.
setlimit tomark p0 for ([substring]) among (
'sz{ek}' // present 1st person singular (noszę)
'sz{ak}' // present 3rd person plural (noszą)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see we can never match 'sz{ak}' here because it's also handled in adjectival (with action delete instead of <- 's'). Testing the example word here, it gets stemmed to no not nos.

Copy link
Member

@ojwb ojwb Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From on a quick test, it looks like always doing <- 's' for 'sz{ak}' is worse than always doing delete.

Based on the example words in the comments (R1 starts at the | in nos|zą and lep|szą), I tried leaving the s in place for shorter words:

      'sz{ak}'  
        (R1 and delete or <-'s')

That fixes noszą to stem to nos as the comment suggests it should, but I couldn't tell if that was better overall (it's at least not obviously worse than always removing it) - might be worth you taking a look at.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried leaving the s in place for shorter words:

Both adjectives and verbs can be short and long. But I decided to check your heuristic that short words are most probably verbs. I took words matching ^[a-z]{,3}szą$ from the sample vocabulary:

  • cieszą — verb
  • dalszą — adjective
  • duszą — noun
  • gorszą — adjective
  • gruszą — noun
  • kuszą — noun
  • lepszą — adjective
  • mszą — noun
  • muszą — verb
  • naszą — pronoun (declines as an adjective)
  • noszą — verb
  • nowszą — adjective
  • pieszą — adjective
  • piszą — verb
  • proszą — verb
  • ruszą — verb
  • skuszą — verb
  • słyszą — verb
  • suszą — noun
  • unoszą — verb
  • waszą — pronoun (declines as an adjective)
  • wnoszą — verb
  • zmuszą — verb
  • znoszą — verb

(Nouns in this list have sz as part of their root, not as a suffix.)

So, indeed most of the short forms in -shą will be verbs. But with any such heuristic there will be many false positives.

What I can say for sure is that forms starting with naj- and having -sz- + adjectival ending on the end are definitely superlative forms of adjectives and should not be treated as verbs. But my level of snowball knowledge does not allow me to express this rule (I did some attempts earlier but failed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And a short review of changes after applying the suggested change (R1 and delete or <-'s') on top of your branch:

source old new part of speech comment
cieszą cie cies verb looks good
duszą du dus noun ideally should be dusz, but better than before
gruszą gru grus noun ideally should be grusz, but better than before
kuszą ku kus noun ideally should be kusz, but better than before
muszą mu mus verb looks good
naszą na nas pronoun ideally should be nasz, but better than before
noszą no nos verb looks good
pieszą pie pies adjective looks good (not a comparative form, sz is part of the root)
piszą pi pis verb looks good
proszą pro pros verb looks good
ruszą ru rus verb ideally should be rusz, but better than before
skuszą sku skus verb looks good
słyszą sły słys verb ideally should be słysz, but better than before
suszą su sus noun or verb ideally should be susz (in both cases), but better than before
waszą wa was pronoun ideally should be wasz, but better than before
wnoszą wno wnos verb looks good
zmuszą zmu zmus verb looks good
znoszą zno znos verb looks good

I don’t see any downsides of this change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I can say for sure is that forms starting with naj- and having -sz- + adjectival ending on the end are definitely superlative forms of adjectives and should not be treated as verbs. But my level of snowball knowledge does not allow me to express this rule

I think you're wanting reverse which swaps the cursor and the limit and flips the direction. Untested, but maybe something like:

'sz{ak}'  
        ((R1 or reverse 'naj') and delete or <-'s')

Not 100% sure I've inserted that in the right place, but hopefully it shows how you'd use it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, except that we set the backwards limit to be two characters from the start of the word at the top level, so with reverse we will test for naj there not at the start of the word.

I'll have a think about how best to do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was taking a look at this but there don't seem to be any words in polish/voc.txt which start naj and reach this rule.

And thinking about it, the R1 definition means that for any word starting naj, R1 will start after the j so testing for naj here is redundant. Were you actually suggesting testing for naj somewhere else?

Marking R1 for words from your ^[a-z]{,3}szą$ list:

  • cies|zą — verb
  • dal|szą — adjective
  • dus|zą — noun
  • gor|szą — adjective
  • grus|zą — noun
  • kus|zą — noun
  • lep|szą — adjective
  • mszą| — noun
  • mus|zą — verb
  • nas|zą — pronoun (declines as an adjective)
  • nos|zą — verb
  • now|szą — adjective
  • pies|zą — adjective
  • pis|zą — verb
  • pros|zą — verb
  • rus|zą — verb
  • skus|zą — verb
  • słys|zą — verb
  • sus|zą — noun
  • un|oszą — verb
  • was|zą — pronoun (declines as an adjective)
  • wnos|zą — verb
  • zmus|zą — verb
  • znos|zą — verb

Ideally we delete -szą for an adjective and replace with -s for a verb.

I notice that most of the adjectives have -szą in R1 because they start consonant vowel consonant - e.g. dal|szą. The only exception seems to be pies|zą, so in general the current R1 test is very good there.

It's similarly good for verbs - only un|oszą seems to be mishandled.

If was|zą declines as an adjective then it's handled wrongly, but handling of pronouns is unlikely to be important unless they get conflated with other unrelated words.

IIUC nouns should be handled with replace with -s like verbs and all the nouns here are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was taking a look at this but there don't seem to be any words in polish/voc.txt which start naj and reach this rule.

There are some words in the vocabulary that match the ^naj.*sz.* regex.
I looked at the output and most of them are stemmed correctly:

polish/voc.txt:najpiękniej
polish/voc.txt:najpiękniejsza
polish/voc.txt:najpiękniejszą
polish/voc.txt:najpiękniejsze
polish/voc.txt:najpiękniejszego
polish/voc.txt:najpiękniejszej
polish/voc.txt:najpiękniejszy
polish/voc.txt:najpiękniejszych
polish/voc.txt:najpiękniejszym
polish/output.txt:najpiękn
polish/output.txt:najpiękn
polish/output.txt:najpiękn
polish/output.txt:najpiękn
polish/output.txt:najpiękn
polish/output.txt:najpiękn
polish/output.txt:najpiękn
polish/output.txt:najpiękn
polish/output.txt:najpiękn

or

polish/voc.txt:najbliżej
polish/voc.txt:najbliżsi
polish/voc.txt:najbliższa
polish/voc.txt:najbliższą
polish/voc.txt:najbliższe
polish/voc.txt:najbliższego
polish/voc.txt:najbliższej
polish/voc.txt:najbliższy
polish/voc.txt:najbliższych
polish/voc.txt:najbliższym
polish/voc.txt:najbliższymi
polish/output.txt:najbliż
polish/output.txt:najbliżs
polish/output.txt:najbliż
polish/output.txt:najbliż
polish/output.txt:najbliż
polish/output.txt:najbliż
polish/output.txt:najbliż
polish/output.txt:najbliż
polish/output.txt:najbliż
polish/output.txt:najbliż
polish/output.txt:najbliż

One thing that catches the eye is that we don't handle plural nominative forms where -szy turns into -si, but that is just one form and I think we can ignore it (it will be hard to fix it properly without false positives).

But again, I think we are doing very good here and no special handling of naj- is needed. Were there any other issues holding this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're happy with the naj- situation, I think the only thing is the website update - we have snowballstem/snowball-website#29 but that's for tomek-ai's version.

I feel that transcribing the algorithm from snowball to English prose is not all that useful. Perhaps it can help find unintended mistakes in the Snowball code, but mostly it gives us two different descriptions of the algorithm, which can then disagree.

What has proved useful is notes about design decisions such as what you chose to do and why (and what you chose not to do and why not).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will try to do it, probably next week.

@ojwb
Copy link
Member

ojwb commented Jun 17, 2025

I wondered if we could actually handle most of the ending removal in one big among, which should be faster - currently among is O(log(#cases)) (so sublinear) but I'm working on an O(1) (or really O(suffix_length_matched)) implementation for C (extremely sublinear!)

This refactor reduced the time taken to stem the vocab list by ~26% for C compared to the version here, while producing the same stems as my initial tweaked version:

ojwb@78acb18

To merge the noun forms into the same among it seems we have to use an among function to achieve only considering them in R1 (hence the R1 added after each of those). That's faster than a separate among with a different setlimit though.

Doing this also helps reveal cases which can't match because they're dominated by cases in earlier checks (I've noted the two I identified in separate comments above).

@mitya57
Copy link
Contributor Author

mitya57 commented Jun 30, 2025

Your modifications look good to me, thank you! I merged them into my branch, and made two commits on top of them.

@ojwb
Copy link
Member

ojwb commented Jul 10, 2025

I've pushed a trivial fix for a warning.

@ojwb
Copy link
Member

ojwb commented Jul 10, 2025

I've pushed a trivial fix for a warning.

CI failed to find the polish data for some reason. The previous build passed and this change clearly couldn't cause problems so don't worry about this failure as far as this PR is concerned (I will see if I can work out why it isn't finding the data).

@ojwb ojwb force-pushed the polish-stemmer-3 branch 3 times, most recently from 3402912 to c0054b6 Compare July 10, 2025 14:31
@ojwb
Copy link
Member

ojwb commented Jul 10, 2025

Apologies for abusing your PR branch to debug the CI issue but it needs quite particular circumstances which would be tricky for me to recreate alone.

Unfortunately I just don't see how to make it work in this case though.

We need to know the repo URL and branch name for the PR branch. We can get the branch name, but the repo URL doesn't seem to be anywhere. We can use $GITHUB_ACTOR to get a username which is the right one when someone opens a PR or that same person pushes to it. However if a maintainer pushes to someone else's PR branch then the maintainer's user name is in $GITHUB_ACTOR instead.

I'd hoped it'd be in an environment variable or there would be a remote pointing at it, but there doesn't seem to be.

The closest I can see is the email address on the (automatically generated) merge commit:

$ git -C "$GITHUB_WORKSPACE" show "pull/$GITHUB_REF_NAME"
commit 4830554d3c2044210cb9643438212f8bec54f0fd
Author: Dmitry Shachnev <[email protected]>
Date:   Thu Jul 10 14:31:56 2025 +0000

    Merge c0054b62971b855b4c0590f308e218feec0ccd22 into 9fb4c01fdc355e9deeed0ff2b4671d3faf3eb239

I'm not sure if this address is always just the github user name before the @ or is it might be a different email address for the user (if it could be it isn't useful to us).

I'll try to reset the branch to a sensible state.

@ojwb ojwb force-pushed the polish-stemmer-3 branch from c0054b6 to a2eb25d Compare July 10, 2025 16:56
mitya57 and others added 8 commits July 12, 2025 18:07
This avoids producing an empty stem for input `cie`, avoids
reducing `liście` to `ł` and avoids removing acute accents
from inputs `ć`, `ń`, `ś` and `ź`.

It is also a bit more efficient (0.7% faster to stem the
sample vocabulary in C).
~19% time reduction from branch point for stemming sample vocab
list in C.
~26% time reduction.
@mitya57 mitya57 force-pushed the polish-stemmer-3 branch from a2eb25d to 6429800 Compare July 12, 2025 15:07
@mitya57
Copy link
Contributor Author

mitya57 commented Jul 12, 2025

I'd hoped it'd be in an environment variable or there would be a remote pointing at it, but there doesn't seem to be.

I agree, that is unfortunate. Maybe it's worth to file a bug against GitHub itself about it?

For now, I rebased my branch and pushed myself, so the tests should hopefully pass now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants