Skip to content

Conversation

saeiddrv
Copy link

No description provided.

@saeiddrv
Copy link
Author

add Persian stemming algorithm!

@ojwb
Copy link
Member

ojwb commented Feb 26, 2024

We also need test data and a description for a new algorithm - there's a detailed description of the requirements in CONTRIBUTING.rst.

@saeiddrv
Copy link
Author

I have fixed the issues.

)
)

define stem as (
Exception
( Exception )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means the same with or without the parentheses (just noting as the commit that changed this was "fix exception section"...)

@ojwb
Copy link
Member

ojwb commented Mar 11, 2025

We also need test data and a description for a new algorithm - there's a detailed description of the requirements in CONTRIBUTING.rst.

@saeiddrv You've since opened snowballstem/snowball-data#25 to add test data - thanks for doing that.

We really do still need a description of the new algorithm though, which should be a pull request against the https://github.com/snowballstem/snowball-website repo.

We've found that algorithms without good documentation are much harder to maintain - for example, if somebody reports that a particular stemmer doesn't remove a prefix or suffix it's a usually lot of work to try to determine if that's an oversight or something that was considered but rejected for a good reason (most languages seem to have a prefix or suffix that can't be safely removed without causes worse problems in words that start or end with the same letters but where they aren't the intended prefix/suffix), but if we have a document that says "suffixes A, B and C are not removed because of X and Y" then we can just point the bug reporter to that.

If the implementation is from an algorithm that's described in a published paper we may be able to link to that to provide most of the information, provided the paper does contain a sufficiently detailed description of the algorithm and discussion of the design decisions that lead to it.

I can help with proof-reading and the like, but I didn't develop the algorithm and I don't know the Persian language so it isn't a document I could write the actual content of.

https://snowballstem.org/algorithms/indonesian/stemmer.html is a reasonable example showing the sort of thing that's useful. There are macros to help render the list of references, the table of example stems, and the snowball code - see algorithms/indonesian/stemmer.tt in the repo.

@saeiddrv
Copy link
Author

@ojwb Thanks for your review, guidance, and patience! I will fix those issues and work on making it more accurate in my next pushes.

@ojwb
Copy link
Member

ojwb commented May 3, 2025

Hmm, it seems you've made updates to fix my comments, but github still shows my comment last so I'd missed this until I noticed the "Outdated" marker on my comment.

I'm currently focusing on getting a new release out (it's very nearly 3.5 years since the last one which is too long and there's a huge pile of unreleased useful changes), and my plan is to then dealing with the pending new algorithms (including this one) and making another release with those.

@ojwb ojwb added this to the 3.1.0 milestone May 5, 2025
'{mim}{nun}{dal}' (delete) // mand
'{nun}{alef}{kaf}' (delete) // naak
'{gaf}{ye}' (delete) // gii
'{alef}{nun}{heh}' (delete) // aaneh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an input word is exactly one of these suffixes, then the output stem is an empty string which seems better avoided even if none of these are actually valid words.

Checking output.txt from the snowball-data PR, there appear to be 155 entries in voc.txt which give an empty stem, at least some of which seem to be real words - e.g. بهانه‌هایی which google translate says means "excuses". We definitely don't want to stem real words to an empty stem.

If we should require leaving at least one character before the removed suffix, that can be easily (and efficiently) achieved by adding a check that the cursor isn't "atlimit" after the substring above like so:

[substring] not atlimit among (

If we should require leaving multiple characters then $(len > X) checks as used elsewhere are OK, if a bit inelegant (and less efficient for UTF-8 where the string length needs computing each time).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://en.wikipedia.org/wiki/Persian_alphabet says there are "32 letters [in] the modern Persian alphabet" - that means there are only 32 possible different single letter stems, so producing a single letter stem for a longer input (particular for real words) is probably also a design error because it's very likely to result in unwanted conflation.

'{mim}' ($(len > 3) delete)
'{ye}{dal}' ($(len > 3) delete)
'{alef}{ye}{dal}' ($(len > 3) delete)
'{alef}{nun}{dal}' ($(len > 3) delete)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we're checking $(len > 3) for every suffix, but the suffixes vary from 1 to 3 characters. Elsewhere the number we check the length against seems to vary with the length of the prefix/suffix (so in Remove_Suffix_Nouns it's the length of the suffix + 1) which effectively means we're enforcing a minimum length on the stem to leave after removing the suffix, which is much more common in stemming algorithms.

Maybe the fixed length check here is OK, but I thought I should flag it up in case it's a mistake.

@saeiddrv
Copy link
Author

saeiddrv commented Aug 8, 2025

Hi @ojwb , sorry for the delay, but I found an interesting approach for stemming the Persian language and have tried to implement it. It works much better now, and the errors with words like "بهانه‌هایی" have been resolved.
Thanks in advance for reviewing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants