create persian.sbl #194

saeiddrv · 2024-02-21T22:16:11Z

No description provided.

saeiddrv · 2024-02-21T22:18:27Z

add Persian stemming algorithm!

ojwb · 2024-02-26T18:22:49Z

We also need test data and a description for a new algorithm - there's a detailed description of the requirements in CONTRIBUTING.rst.

python/modules.txt

algorithms/persian.sbl

saeiddrv · 2024-02-27T12:10:34Z

I have fixed the issues.

algorithms/persian.sbl

ojwb · 2024-02-27T21:31:01Z

algorithms/persian.sbl

    )
 )

 define stem as (
-    Exception
+    ( Exception )


This means the same with or without the parentheses (just noting as the commit that changed this was "fix exception section"...)

libstemmer/modules.txt

ojwb · 2025-03-11T04:12:50Z

We also need test data and a description for a new algorithm - there's a detailed description of the requirements in CONTRIBUTING.rst.

@saeiddrv You've since opened snowballstem/snowball-data#25 to add test data - thanks for doing that.

We really do still need a description of the new algorithm though, which should be a pull request against the https://github.com/snowballstem/snowball-website repo.

We've found that algorithms without good documentation are much harder to maintain - for example, if somebody reports that a particular stemmer doesn't remove a prefix or suffix it's a usually lot of work to try to determine if that's an oversight or something that was considered but rejected for a good reason (most languages seem to have a prefix or suffix that can't be safely removed without causes worse problems in words that start or end with the same letters but where they aren't the intended prefix/suffix), but if we have a document that says "suffixes A, B and C are not removed because of X and Y" then we can just point the bug reporter to that.

If the implementation is from an algorithm that's described in a published paper we may be able to link to that to provide most of the information, provided the paper does contain a sufficiently detailed description of the algorithm and discussion of the design decisions that lead to it.

I can help with proof-reading and the like, but I didn't develop the algorithm and I don't know the Persian language so it isn't a document I could write the actual content of.

https://snowballstem.org/algorithms/indonesian/stemmer.html is a reasonable example showing the sort of thing that's useful. There are macros to help render the list of references, the table of example stems, and the snowball code - see algorithms/indonesian/stemmer.tt in the repo.

algorithms/persian.sbl

saeiddrv · 2025-03-11T20:14:53Z

@ojwb Thanks for your review, guidance, and patience! I will fix those issues and work on making it more accurate in my next pushes.

algorithms/persian.sbl

ojwb · 2025-05-03T22:45:57Z

Hmm, it seems you've made updates to fix my comments, but github still shows my comment last so I'd missed this until I noticed the "Outdated" marker on my comment.

I'm currently focusing on getting a new release out (it's very nearly 3.5 years since the last one which is too long and there's a huge pile of unreleased useful changes), and my plan is to then dealing with the pending new algorithms (including this one) and making another release with those.

algorithms/persian.sbl

ojwb · 2025-05-28T04:04:23Z

algorithms/persian.sbl

+            '{mim}{nun}{dal}' (delete)    // mand
+            '{nun}{alef}{kaf}' (delete)   // naak
+            '{gaf}{ye}' (delete)          // gii
+            '{alef}{nun}{heh}' (delete)   // aaneh


If an input word is exactly one of these suffixes, then the output stem is an empty string which seems better avoided even if none of these are actually valid words.

Checking output.txt from the snowball-data PR, there appear to be 155 entries in voc.txt which give an empty stem, at least some of which seem to be real words - e.g. بهانه‌هایی which google translate says means "excuses". We definitely don't want to stem real words to an empty stem.

If we should require leaving at least one character before the removed suffix, that can be easily (and efficiently) achieved by adding a check that the cursor isn't "atlimit" after the substring above like so:

[substring] not atlimit among (

If we should require leaving multiple characters then $(len > X) checks as used elsewhere are OK, if a bit inelegant (and less efficient for UTF-8 where the string length needs computing each time).

https://en.wikipedia.org/wiki/Persian_alphabet says there are "32 letters [in] the modern Persian alphabet" - that means there are only 32 possible different single letter stems, so producing a single letter stem for a longer input (particular for real words) is probably also a design error because it's very likely to result in unwanted conflation.

ojwb · 2025-05-28T04:12:37Z

algorithms/persian.sbl

+            '{mim}' ($(len > 3) delete)
+            '{ye}{dal}' ($(len > 3) delete)
+            '{alef}{ye}{dal}' ($(len > 3) delete)
+            '{alef}{nun}{dal}' ($(len > 3) delete)


Here we're checking $(len > 3) for every suffix, but the suffixes vary from 1 to 3 characters. Elsewhere the number we check the length against seems to vary with the length of the prefix/suffix (so in Remove_Suffix_Nouns it's the length of the suffix + 1) which effectively means we're enforcing a minimum length on the stem to leave after removing the suffix, which is much more common in stemming algorithms.

Maybe the fixed length check here is OK, but I thought I should flag it up in case it's a mistake.

saeiddrv · 2025-08-08T09:49:34Z

Hi @ojwb , sorry for the delay, but I found an interesting approach for stemming the Persian language and have tried to implement it. It works much better now, and the errors with words like "بهانه‌هایی" have been resolved.
Thanks in advance for reviewing

create persian.sbl

44057d0

ojwb reviewed Feb 26, 2024

View reviewed changes

python/modules.txt Outdated Show resolved Hide resolved

ojwb reviewed Feb 26, 2024

View reviewed changes

algorithms/persian.sbl Outdated Show resolved Hide resolved

ojwb reviewed Feb 26, 2024

View reviewed changes

algorithms/persian.sbl Outdated Show resolved Hide resolved

Saeid Darvish added 5 commits February 26, 2024 20:06

fix errors

f8cfec9

add exception section

0248279

fix exception section

90c083c

add Suffix_Normalize

40f59b3

fix modules.txt

bfc5974

ojwb reviewed Feb 27, 2024

View reviewed changes

algorithms/persian.sbl Outdated Show resolved Hide resolved

ojwb reviewed Feb 27, 2024

View reviewed changes

algorithms/persian.sbl Outdated Show resolved Hide resolved

ojwb reviewed Feb 27, 2024

View reviewed changes

define arabic characters

27975e6

ojwb reviewed Mar 11, 2025

View reviewed changes

libstemmer/modules.txt Outdated Show resolved Hide resolved

ojwb reviewed Mar 11, 2025

View reviewed changes

algorithms/persian.sbl Outdated Show resolved Hide resolved

saeiddrv added 3 commits March 11, 2025 19:32

fix language code

be3c55d

fix using next statement

84ac8a4

add more exceptions

1645e2d

ojwb mentioned this pull request Mar 12, 2025

Farsi/Persian language support #181

Open

saeiddrv added 4 commits March 23, 2025 18:04

improve steps

bb9b69f

fix Normalize_Nouns

9aad969

fix Exceptions

24d0231

update header comment

a1e92cc

ojwb reviewed Mar 24, 2025

View reviewed changes

algorithms/persian.sbl Outdated Show resolved Hide resolved

ojwb mentioned this pull request Apr 29, 2025

Improved Polish stemmer from #159 #220

Closed

ojwb reviewed May 3, 2025

View reviewed changes

algorithms/persian.sbl Outdated Show resolved Hide resolved

ojwb added this to the 3.1.0 milestone May 5, 2025

Add explanatory comment for U200C half-space usage in Persian

f536e2e

ojwb reviewed May 28, 2025

View reviewed changes

update persian algorithm based on the HPS

e5e980d

create persian.sbl #194

Are you sure you want to change the base?

create persian.sbl #194

Uh oh!

Conversation

saeiddrv commented Feb 21, 2024

Uh oh!

saeiddrv commented Feb 21, 2024

Uh oh!

ojwb commented Feb 26, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

saeiddrv commented Feb 27, 2024

Uh oh!

Uh oh!

Uh oh!

ojwb Feb 27, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ojwb commented Mar 11, 2025

Uh oh!

Uh oh!

saeiddrv commented Mar 11, 2025

Uh oh!

Uh oh!

ojwb commented May 3, 2025

Uh oh!

Uh oh!

ojwb May 28, 2025

Choose a reason for hiding this comment

Uh oh!

ojwb May 28, 2025

Choose a reason for hiding this comment

Uh oh!

ojwb May 28, 2025

Choose a reason for hiding this comment

Uh oh!

saeiddrv commented Aug 8, 2025

Uh oh!

Uh oh!