Skip to content

Conversation

veloman-yunkan
Copy link
Collaborator

@veloman-yunkan veloman-yunkan commented May 26, 2025

Fixes #731

Copy link

codecov bot commented May 26, 2025

Codecov Report

❌ Patch coverage is 58.99281% with 57 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.10%. Comparing base (5d00100) to head (2da49eb).

Files with missing lines Patch % Lines
src/suggestion.cpp 56.81% 6 Missing and 51 partials ⚠️
Additional details and impacted files
@@                   Coverage Diff                   @@
##           suggestions_cleanup     #994      +/-   ##
=======================================================
- Coverage                58.13%   58.10%   -0.03%     
=======================================================
  Files                      101      102       +1     
  Lines                     5384     5519     +135     
  Branches                  2197     2263      +66     
=======================================================
+ Hits                      3130     3207      +77     
- Misses                     795      804       +9     
- Partials                  1459     1508      +49     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kelson42
Copy link
Contributor

kelson42 commented Jun 9, 2025

@veloman-yunkan I made a first quick run, here my remarks:

  • Spelling suggestions and "normal" suggestions are mixed on kiwix-serve which makes a bit difficult immediately to understand what is what.
  • It seems the suggestions are not "really" clickable anymore (looks like a regressions, but not dramatic because not the topic of the PR)
  • In general it seems to work, but testing with the testset of @gremid, we have a few things difficult to understand, for example ("enttäuschen" should be in the list):
    image
  • To me and my first test, multiwords spell checking does not work or is not convincing
  • We should think also to "limit" the limit of spellcheck corrections we propose... but how?

I keep thinking testing this on the command line with zim-search would be more efficient. But we have to secure the solution works for the list of @gremid before any other consideration, see https://github.com/gremid/xapian-spelling-suggestions/blob/main/testdata.csv

@veloman-yunkan
Copy link
Collaborator Author

  • In general it seems to work, but testing with the testset of @gremid, we have a few things difficult to understand, for example ("enttäuschen" should be in the list):
    image

It is in the list as enttauschen (without diacritics) because suggestions are based on indexed data from which diacritics have been removed.

@veloman-yunkan
Copy link
Collaborator Author

  • It seems the suggestions are not "really" clickable anymore (looks like a regressions, but not dramatic because not the topic of the PR)

@kelson42 What do you mean by that? It works for me. If a title suggestion is clicked the linked page is opened. Clicking on the completion or spelling suggestion changes the text in the search box.

@veloman-yunkan
Copy link
Collaborator Author

I keep thinking testing this on the command line with zim-search would be more efficient.

Do you mean automated testing? Otherwise how can command line usage of suggestion functionality be more efficient compared to interactive instant feedback?

@veloman-yunkan
Copy link
Collaborator Author

  • To me and my first test, multiwords spell checking does not work or is not convincing

In a multiword query only the last word is spell-checked.

@veloman-yunkan
Copy link
Collaborator Author

@kelson42 I guess some of the "issues" that you've experienced with spelling corrections could be caused by the lag in spelling corrections on large ZIM files (with the title vocabulary exceeding a few hundred words) since the temporary spell-checker database was being created on every suggestion request. I have now committed an optimization so that the (still temporary in-memory) spellchecker database is cached in suggestion searcher.

@kelson42
Copy link
Contributor

I keep thinking testing this on the command line with zim-search would be more efficient.

Do you mean automated testing? Otherwise how can command line usage of suggestion functionality be more efficient compared to interactive instant feedback?

@veloman-yunkan No, I mean zimsearch. https://github.com/openzim/zim-tools/blob/main/src/zimsearch.cpp. I don't want to depend on a web server to test this. Can you implement the feature there so I can test without having:

  • Results mixed with other results
  • Having to deal with a browser
  • A complex software stack like kiwix-serve

@kelson42
Copy link
Contributor

  • In general it seems to work, but testing with the testset of @gremid, we have a few things difficult to understand, for example ("enttäuschen" should be in the list):
    image

It is in the list as enttauschen (without diacritics) because suggestions are based on indexed data from which diacritics have been removed.

We never display suggestions without diacritics. The indexation is made without diacritics, but was we display is always with diacritics. So I don't understand the argumentation here.

@kelson42
Copy link
Contributor

@kelson42 I guess some of the "issues" that you've experienced with spelling corrections could be caused by the lag in spelling corrections on large ZIM files (with the title vocabulary exceeding a few hundred words) since the temporary spell-checker database was being created on every suggestion request. I have now committed an optimization so that the (still temporary in-memory) spellchecker database is cached in suggestion searcher.

This is an additional reason why we should implement this first in zimsearch.

@veloman-yunkan veloman-yunkan force-pushed the spelling_correction branch 2 times, most recently from 6df44d2 to 8d72969 Compare September 26, 2025 11:59
@veloman-yunkan veloman-yunkan changed the base branch from main to suggestions_cleanup September 26, 2025 12:04
This is just a starting point for beginning the TDD process of
implementing autocompletion and spelling correction support in libzim.
But the stated autocompletion functionality is not yet implemented.

SuggestionSearch::getAutocompletionResults() doesn't even get called
contrary to expectations. Need to debug it.
It turned out that SuggestionSearch::getAutocompletionResults() stub
introduced in the previous commit was actually being called. However
instead of returning an empty result set it delivered all entries
because of a bug with the usage of zim::Archive::Entry::offset() that
was copied from SuggestionSearch::getResults(). That bug has to be fixed
too.
This commit somewhat reverts an earlier commit "Added result count limit
to SuggestionSearcher::suggest()", and introduces support for
auto-completion suggestions (and a similar door to spelling correction
suggestions) in a backward compatible way.
Implemented SuggestionSearch::getAutocompletionSuggestions() via fake
implementation of some of the newly introduced helper code that only
works for the sole unit test addressing the new functionality.
Added more titles to the Suggestion.autoCompletionAndSpellingCorrection
unit-test so that it can be enhanced with more checks.
But still works only on unit-test data.
Renamed the unit-test `Suggestion.autoCompletionAndSpellingCorrection`
to `Suggestion.smartSuggestions`.

Also renamed the macro `EXPECT_SUGGESTION_RESULTS()` to
`EXPECT_SMART_SUGGESTION_RESULTS`.
The new unit-test covers
SuggestionSearch::getAutocompletionSuggestions() and demonstrates its
various shortcomings by passing (rather than failing) due to usage of
the results returned by the current implementation as the expected
outcome.
Filtered out stemmed terms from the autocompletion database
Autocompletion query is now case & diacritics insensitive. Ignoring
diacritics in the query during autocompletion is probably wrong.
However, in the current implementation it cannot be easily fixed, since
autocompletion works off of the terms recorded in the title index
database, and diacritics info is removed during indexing.
This was supposed to be a draft implementation but it doesn't work since
support for spelling correction is not implemented in the InMemory
backend of Xapian.
Made SpellingsDB work using a Glass Xapian database in an in-memory
filesystem. This is only a temporary workaround for the proof-of-concept
stage. The plan is to use a better spelling suggestion engine (nuspell)
but that will require significant additional effort.
But that may result in duplicated suggestions as demonstrated by some
test points in the Suggestion.smartSuggestions unit-test.
The results of obtaining all suggestion terms are memoized in
SuggestionDataBase.
Made spellings DB a member of SuggestionsDataBase so that it is not
recreated on every call of getSpellingCorrections().
@kelson42 kelson42 linked an issue Sep 28, 2025 that may be closed by this pull request
Base automatically changed from suggestions_cleanup to main September 29, 2025 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Offer search term spelling corrections
3 participants