Polish Stemmer v3 #245

mitya57 · 2025-06-15T19:23:13Z

This PR supersedes #159 and #220.

I rewrote the Polish stemmer almost from scratch. The main differences compared to previous approaches are:

I ignore derivational suffixes and focus on inflectional suffixes. Also I ignore the naj- superlative prefix. This allows me to keep the code cleaner.
The list of endings is more systematic and ordered. I tried to cover as many different forms and declension/conjugation patterns as possible.
The functions are grouped using or, not using do.
Added a new step to remove trailing kreska, which indicates softness of consonant (like Russian ь) and is missing in most oblique cases because they have other letters on the end.

I decided to make a new PR, first because this is a rewrite so most of old comments are irrelevant, and second to make it easier to compare this approach with the previous ones.

This PR has only one FIXME comment. I tried to resolve it by making p0 hardcoded ($p0 = 2), but that behaves differently in UTF-8 and ISO-8859-2 encodings, so I did not go this way after all.

ojwb · 2025-06-15T21:06:31Z

Thanks.

This PR has only one FIXME comment. I tried to resolve it by making p0 hardcoded ($p0 = 2), but that behaves differently in UTF-8 and ISO-8859-2 encodings, so I did not go this way after all.

Also different for UTF-8 vs wide characters.

My immediate thought was Czech which has syllabic consonants, but according to https://en.wikipedia.org/wiki/Polish_language#Consonant_distribution : Unlike languages such as Czech, Polish does not have syllabic consonants – the nucleus of a syllable is always a vowel.

Maybe it's worth trying to think about whether there are rules about these consonant clusters we can use though.

If you want the cursor two Unicode characters in, then next 2 gives that. E.g. see mark_regions in german.sbl which enforces a minimum of 3 characters before p1.

Are you going to be at debcamp/debconf next month?

mitya57 · 2025-06-15T21:52:15Z

Maybe it's worth trying to think about whether there are rules about these consonant clusters we can use though.

I don't think there is a law here.

chcieć and brać used to have a yer sound, which was regularly dropped. An alternative form chocieć existed for some time, though.
But not every consonant cluster arose this way: for example, grać is a clipped version of igrać, and did not have a yer.
Verbs like mieć, być, pić, wyć, ryć, żyć do not have any clusters at all, just one consonant. There are not many such verbs, though (these are all that I managed to come up with). Edit: after further thinking I can also name myć and nyć.

If you want the cursor two Unicode characters in, then next 2 gives that.

Thank you, I will give it a try. Number 2 was chosen quite arbitrarily, maybe I should also play with 1 and 3 and look at the results.

Are you going to be at debcamp/debconf next month?

Unfortunately, no…

ojwb · 2025-06-15T23:04:43Z

Maybe it's worth trying to think about whether there are rules about these consonant clusters we can use though.

I don't think there is a law here.

Fair enough. I noticed Wikipedia says "Polish can have word-initial and word-medial clusters of up to four consonants" but that's probably not helpful unless it's possible to have five or more initial consonants and we should break after four in those cases.

Verbs like mieć, być, pić, wyć, ryć, żyć do not have any clusters at all, just one consonant. There are not many such verbs, though (these are all that I managed to come up with). Edit: after further thinking I can also name myć and nyć.

Producing single letter stems, even if linguistically correct, can be problematic - there just aren't very many single letters so unwanted conflation is a bigger risk - e.g. consider discussion of "the letter b", musical notes: "in the key of B minor", chemical elements: "B is the symbol for Boron", itemisation: "In case b, we see [...]", "blood type: B", "vitamin B", "Hepatitis B", people's initials: "B. Wilson", etc.

There's of course unwanted conflation in search results between all of these already, but stemming forms of być to b will likely mean searching for them produces a lot more noise and the benefits of conflating the different forms will tend to be significantly reduced if not wiped out entirely.

So it might be better to just not conflate forms of such words, or if possible map them to a longer stem. Grouping some forms together on to a longer stem (or onto several longer stems) could also be an option (probably terrible example just to illustrate what I mean: for być, conflating byłem, byłam, byłom, byłyśmy, etc on stem był).

mitya57 · 2025-06-16T15:55:15Z

So it might be better to just not conflate forms of such words, or if possible map them to a longer stem. Grouping some forms together on to a longer stem (or onto several longer stems) could also be an option (probably terrible example just to illustrate what I mean: for być, conflating byłem, byłam, byłom, byłyśmy, etc on stem był).

I implemented this now, and removed my FIXME comment.

algorithms/polish.sbl

ojwb · 2025-06-17T00:09:29Z

algorithms/polish.sbl

+    // so we need to process it after we are done with adjectival forms.
+    setlimit tomark p0 for ([substring]) among (
+      'sz{ek}' // present 1st person singular (noszę)
+      'sz{ak}' // present 3rd person plural (noszą)


As far as I can see we can never match 'sz{ak}' here because it's also handled in adjectival (with action delete instead of <- 's'). Testing the example word here, it gets stemmed to no not nos.

From on a quick test, it looks like always doing <- 's' for 'sz{ak}' is worse than always doing delete.

Based on the example words in the comments (R1 starts at the | in nos|zą and lep|szą), I tried leaving the s in place for shorter words:

'sz{ak}' (R1 and delete or <-'s')

That fixes noszą to stem to nos as the comment suggests it should, but I couldn't tell if that was better overall (it's at least not obviously worse than always removing it) - might be worth you taking a look at.

I tried leaving the s in place for shorter words:

Both adjectives and verbs can be short and long. But I decided to check your heuristic that short words are most probably verbs. I took words matching ^[a-z]{,3}szą$ from the sample vocabulary:

cieszą — verb

dalszą — adjective

duszą — noun

gorszą — adjective

gruszą — noun

kuszą — noun

lepszą — adjective

mszą — noun

muszą — verb

naszą — pronoun (declines as an adjective)

noszą — verb

nowszą — adjective

pieszą — adjective

piszą — verb

proszą — verb

ruszą — verb

skuszą — verb

słyszą — verb

suszą — noun

unoszą — verb

waszą — pronoun (declines as an adjective)

wnoszą — verb

zmuszą — verb

znoszą — verb

(Nouns in this list have sz as part of their root, not as a suffix.)

So, indeed most of the short forms in -shą will be verbs. But with any such heuristic there will be many false positives.

What I can say for sure is that forms starting with naj- and having -sz- + adjectival ending on the end are definitely superlative forms of adjectives and should not be treated as verbs. But my level of snowball knowledge does not allow me to express this rule (I did some attempts earlier but failed).

And a short review of changes after applying the suggested change (R1 and delete or <-'s') on top of your branch:

source old new part of speech comment

cieszą cie cies verb looks good

duszą du dus noun ideally should be dusz, but better than before

gruszą gru grus noun ideally should be grusz, but better than before

kuszą ku kus noun ideally should be kusz, but better than before

muszą mu mus verb looks good

naszą na nas pronoun ideally should be nasz, but better than before

noszą no nos verb looks good

pieszą pie pies adjective looks good (not a comparative form, sz is part of the root)

piszą pi pis verb looks good

proszą pro pros verb looks good

ruszą ru rus verb ideally should be rusz, but better than before

skuszą sku skus verb looks good

słyszą sły słys verb ideally should be słysz, but better than before

suszą su sus noun or verb ideally should be susz (in both cases), but better than before

waszą wa was pronoun ideally should be wasz, but better than before

wnoszą wno wnos verb looks good

zmuszą zmu zmus verb looks good

znoszą zno znos verb looks good

I don’t see any downsides of this change.

What I can say for sure is that forms starting with naj- and having -sz- + adjectival ending on the end are definitely superlative forms of adjectives and should not be treated as verbs. But my level of snowball knowledge does not allow me to express this rule

I think you're wanting reverse which swaps the cursor and the limit and flips the direction. Untested, but maybe something like:

'sz{ak}' ((R1 or reverse 'naj') and delete or <-'s')

Not 100% sure I've inserted that in the right place, but hopefully it shows how you'd use it.

Oh, except that we set the backwards limit to be two characters from the start of the word at the top level, so with reverse we will test for naj there not at the start of the word.

I'll have a think about how best to do this.

I was taking a look at this but there don't seem to be any words in polish/voc.txt which start naj and reach this rule.

And thinking about it, the R1 definition means that for any word starting naj, R1 will start after the j so testing for naj here is redundant. Were you actually suggesting testing for naj somewhere else?

Marking R1 for words from your ^[a-z]{,3}szą$ list:

cies|zą — verb

dal|szą — adjective

dus|zą — noun

gor|szą — adjective

grus|zą — noun

kus|zą — noun

lep|szą — adjective

mszą| — noun

mus|zą — verb

nas|zą — pronoun (declines as an adjective)

nos|zą — verb

now|szą — adjective

pies|zą — adjective

pis|zą — verb

pros|zą — verb

rus|zą — verb

skus|zą — verb

słys|zą — verb

sus|zą — noun

un|oszą — verb

was|zą — pronoun (declines as an adjective)

wnos|zą — verb

zmus|zą — verb

znos|zą — verb

Ideally we delete -szą for an adjective and replace with -s for a verb.

I notice that most of the adjectives have -szą in R1 because they start consonant vowel consonant - e.g. dal|szą. The only exception seems to be pies|zą, so in general the current R1 test is very good there.

It's similarly good for verbs - only un|oszą seems to be mishandled.

If was|zą declines as an adjective then it's handled wrongly, but handling of pronouns is unlikely to be important unless they get conflated with other unrelated words.

IIUC nouns should be handled with replace with -s like verbs and all the nouns here are.

I was taking a look at this but there don't seem to be any words in polish/voc.txt which start naj and reach this rule.

There are some words in the vocabulary that match the ^naj.*sz.* regex.
I looked at the output and most of them are stemmed correctly:

polish/voc.txt:najpiękniej polish/voc.txt:najpiękniejsza polish/voc.txt:najpiękniejszą polish/voc.txt:najpiękniejsze polish/voc.txt:najpiękniejszego polish/voc.txt:najpiękniejszej polish/voc.txt:najpiękniejszy polish/voc.txt:najpiękniejszych polish/voc.txt:najpiękniejszym polish/output.txt:najpiękn polish/output.txt:najpiękn polish/output.txt:najpiękn polish/output.txt:najpiękn polish/output.txt:najpiękn polish/output.txt:najpiękn polish/output.txt:najpiękn polish/output.txt:najpiękn polish/output.txt:najpiękn

or

polish/voc.txt:najbliżej polish/voc.txt:najbliżsi polish/voc.txt:najbliższa polish/voc.txt:najbliższą polish/voc.txt:najbliższe polish/voc.txt:najbliższego polish/voc.txt:najbliższej polish/voc.txt:najbliższy polish/voc.txt:najbliższych polish/voc.txt:najbliższym polish/voc.txt:najbliższymi polish/output.txt:najbliż polish/output.txt:najbliżs polish/output.txt:najbliż polish/output.txt:najbliż polish/output.txt:najbliż polish/output.txt:najbliż polish/output.txt:najbliż polish/output.txt:najbliż polish/output.txt:najbliż polish/output.txt:najbliż polish/output.txt:najbliż

One thing that catches the eye is that we don't handle plural nominative forms where -szy turns into -si, but that is just one form and I think we can ignore it (it will be hard to fix it properly without false positives).

But again, I think we are doing very good here and no special handling of naj- is needed. Were there any other issues holding this PR?

If you're happy with the naj- situation, I think the only thing is the website update - we have snowballstem/snowball-website#29 but that's for tomek-ai's version.

I feel that transcribing the algorithm from snowball to English prose is not all that useful. Perhaps it can help find unintended mistakes in the Snowball code, but mostly it gives us two different descriptions of the algorithm, which can then disagree.

What has proved useful is notes about design decisions such as what you chose to do and why (and what you chose not to do and why not).

Okay, I will try to do it, probably next week.

algorithms/polish.sbl

ojwb · 2025-06-17T00:55:32Z

I wondered if we could actually handle most of the ending removal in one big among, which should be faster - currently among is O(log(#cases)) (so sublinear) but I'm working on an O(1) (or really O(suffix_length_matched)) implementation for C (extremely sublinear!)

This refactor reduced the time taken to stem the vocab list by ~26% for C compared to the version here, while producing the same stems as my initial tweaked version:

ojwb@78acb18

To merge the noun forms into the same among it seems we have to use an among function to achieve only considering them in R1 (hence the R1 added after each of those). That's faster than a separate among with a different setlimit though.

Doing this also helps reveal cases which can't match because they're dominated by cases in earlier checks (I've noted the two I identified in separate comments above).

mitya57 · 2025-06-30T20:49:01Z

Your modifications look good to me, thank you! I merged them into my branch, and made two commits on top of them.

ojwb · 2025-07-10T08:37:23Z

I've pushed a trivial fix for a warning.

ojwb · 2025-07-10T09:38:00Z

I've pushed a trivial fix for a warning.

CI failed to find the polish data for some reason. The previous build passed and this change clearly couldn't cause problems so don't worry about this failure as far as this PR is concerned (I will see if I can work out why it isn't finding the data).

ojwb · 2025-07-10T16:55:46Z

Apologies for abusing your PR branch to debug the CI issue but it needs quite particular circumstances which would be tricky for me to recreate alone.

Unfortunately I just don't see how to make it work in this case though.

We need to know the repo URL and branch name for the PR branch. We can get the branch name, but the repo URL doesn't seem to be anywhere. We can use $GITHUB_ACTOR to get a username which is the right one when someone opens a PR or that same person pushes to it. However if a maintainer pushes to someone else's PR branch then the maintainer's user name is in $GITHUB_ACTOR instead.

I'd hoped it'd be in an environment variable or there would be a remote pointing at it, but there doesn't seem to be.

The closest I can see is the email address on the (automatically generated) merge commit:

$ git -C "$GITHUB_WORKSPACE" show "pull/$GITHUB_REF_NAME"
commit 4830554d3c2044210cb9643438212f8bec54f0fd
Author: Dmitry Shachnev <[email protected]>
Date:   Thu Jul 10 14:31:56 2025 +0000

    Merge c0054b62971b855b4c0590f308e218feec0ccd22 into 9fb4c01fdc355e9deeed0ff2b4671d3faf3eb239

I'm not sure if this address is always just the github user name before the @ or is it might be a different email address for the user (if it could be it isn't useful to us).

I'll try to reset the branch to a sensible state.

This avoids producing an empty stem for input `cie`, avoids reducing `liście` to `ł` and avoids removing acute accents from inputs `ć`, `ń`, `ś` and `ź`. It is also a bit more efficient (0.7% faster to stem the sample vocabulary in C).

About 8% faster

~19% time reduction from branch point for stemming sample vocab list in C.

~26% time reduction.

Co-authored-by: Olly Betts <[email protected]>

mitya57 · 2025-07-12T15:09:10Z

I'd hoped it'd be in an environment variable or there would be a remote pointing at it, but there doesn't seem to be.

I agree, that is unfortunate. Maybe it's worth to file a bug against GitHub itself about it?

For now, I rebased my branch and pushed myself, so the tests should hopefully pass now.

mitya57 mentioned this pull request Jun 15, 2025

Improved Polish stemmer from #159 #220

Closed

mitya57 force-pushed the polish-stemmer-3 branch from 7feb805 to fb0887a Compare June 15, 2025 19:36

mitya57 force-pushed the polish-stemmer-3 branch from fb0887a to 6c2d03a Compare June 16, 2025 15:53

ojwb reviewed Jun 16, 2025

View reviewed changes

algorithms/polish.sbl Outdated Show resolved Hide resolved

ojwb reviewed Jun 16, 2025

View reviewed changes

algorithms/polish.sbl Show resolved Hide resolved

ojwb reviewed Jun 17, 2025

View reviewed changes

algorithms/polish.sbl Outdated Show resolved Hide resolved

ojwb force-pushed the polish-stemmer-3 branch 3 times, most recently from 3402912 to c0054b6 Compare July 10, 2025 14:31

ojwb force-pushed the polish-stemmer-3 branch from c0054b6 to a2eb25d Compare July 10, 2025 16:56

mitya57 and others added 8 commits July 12, 2025 18:07

Add Polish stemmer

e4120ea

Apply 2 character limit at the top level

a819e4f

This avoids producing an empty stem for input `cie`, avoids reducing `liście` to `ł` and avoids removing acute accents from inputs `ć`, `ń`, `ś` and `ź`. It is also a bit more efficient (0.7% faster to stem the sample vocabulary in C).

Merge short_verb into verb

89ce8df

About 8% faster

Handle verbs and adjectives in one big among

bc34964

~19% time reduction from branch point for stemming sample vocab list in C.

One big among for all endings

21d2c4c

~26% time reduction.

polish: Improve handling of -shą

4d218dc

Co-authored-by: Olly Betts <[email protected]>

polish: Minor cleanups

18b749d

polish: Remove unused routine declaration

6429800

mitya57 force-pushed the polish-stemmer-3 branch from a2eb25d to 6429800 Compare July 12, 2025 15:07

source	old	new	part of speech	comment
cieszą	cie	cies	verb	looks good
duszą	du	dus	noun	ideally should be dusz, but better than before
gruszą	gru	grus	noun	ideally should be grusz, but better than before
kuszą	ku	kus	noun	ideally should be kusz, but better than before
muszą	mu	mus	verb	looks good
naszą	na	nas	pronoun	ideally should be nasz, but better than before
noszą	no	nos	verb	looks good
pieszą	pie	pies	adjective	looks good (not a comparative form, sz is part of the root)
piszą	pi	pis	verb	looks good
proszą	pro	pros	verb	looks good
ruszą	ru	rus	verb	ideally should be rusz, but better than before
skuszą	sku	skus	verb	looks good
słyszą	sły	słys	verb	ideally should be słysz, but better than before
suszą	su	sus	noun or verb	ideally should be susz (in both cases), but better than before
waszą	wa	was	pronoun	ideally should be wasz, but better than before
wnoszą	wno	wnos	verb	looks good
zmuszą	zmu	zmus	verb	looks good
znoszą	zno	znos	verb	looks good

Polish Stemmer v3 #245

Are you sure you want to change the base?

Polish Stemmer v3 #245

Uh oh!

Conversation

mitya57 commented Jun 15, 2025

Uh oh!

ojwb commented Jun 15, 2025

Uh oh!

mitya57 commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ojwb commented Jun 15, 2025

Uh oh!

mitya57 commented Jun 16, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ojwb Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ojwb commented Jun 17, 2025

Uh oh!

mitya57 commented Jun 30, 2025

Uh oh!

ojwb commented Jul 10, 2025

Uh oh!

ojwb commented Jul 10, 2025

Uh oh!

ojwb commented Jul 10, 2025

Uh oh!

mitya57 commented Jul 12, 2025

Uh oh!

Uh oh!

mitya57 commented Jun 15, 2025 •

edited

Loading

ojwb Jun 17, 2025 •

edited

Loading