[WIP] Support for auto-completion and spelling correction in suggestions #994

veloman-yunkan · 2025-05-26T13:47:34Z

Fixes #731

codecov · 2025-05-26T13:52:04Z

Codecov Report

❌ Patch coverage is 58.99281% with 57 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.10%. Comparing base (5d00100) to head (2da49eb).

Files with missing lines	Patch %	Lines
src/suggestion.cpp	56.81%	6 Missing and 51 partials ⚠️

Additional details and impacted files

@@                   Coverage Diff                   @@
##           suggestions_cleanup     #994      +/-   ##
=======================================================
- Coverage                58.13%   58.10%   -0.03%     
=======================================================
  Files                      101      102       +1     
  Lines                     5384     5519     +135     
  Branches                  2197     2263      +66     
=======================================================
+ Hits                      3130     3207      +77     
- Misses                     795      804       +9     
- Partials                  1459     1508      +49

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

src/suggestion.cpp

kelson42 · 2025-06-09T19:45:49Z

@veloman-yunkan I made a first quick run, here my remarks:

Spelling suggestions and "normal" suggestions are mixed on kiwix-serve which makes a bit difficult immediately to understand what is what.
It seems the suggestions are not "really" clickable anymore (looks like a regressions, but not dramatic because not the topic of the PR)
In general it seems to work, but testing with the testset of @gremid, we have a few things difficult to understand, for example ("enttäuschen" should be in the list):
To me and my first test, multiwords spell checking does not work or is not convincing
We should think also to "limit" the limit of spellcheck corrections we propose... but how?

I keep thinking testing this on the command line with zim-search would be more efficient. But we have to secure the solution works for the list of @gremid before any other consideration, see https://github.com/gremid/xapian-spelling-suggestions/blob/main/testdata.csv

veloman-yunkan · 2025-06-12T12:10:36Z

In general it seems to work, but testing with the testset of @gremid, we have a few things difficult to understand, for example ("enttäuschen" should be in the list):

It is in the list as enttauschen (without diacritics) because suggestions are based on indexed data from which diacritics have been removed.

veloman-yunkan · 2025-06-12T12:19:09Z

It seems the suggestions are not "really" clickable anymore (looks like a regressions, but not dramatic because not the topic of the PR)

@kelson42 What do you mean by that? It works for me. If a title suggestion is clicked the linked page is opened. Clicking on the completion or spelling suggestion changes the text in the search box.

veloman-yunkan · 2025-06-12T12:26:57Z

I keep thinking testing this on the command line with zim-search would be more efficient.

Do you mean automated testing? Otherwise how can command line usage of suggestion functionality be more efficient compared to interactive instant feedback?

veloman-yunkan · 2025-06-12T12:28:08Z

To me and my first test, multiwords spell checking does not work or is not convincing

In a multiword query only the last word is spell-checked.

veloman-yunkan · 2025-06-12T13:09:34Z

@kelson42 I guess some of the "issues" that you've experienced with spelling corrections could be caused by the lag in spelling corrections on large ZIM files (with the title vocabulary exceeding a few hundred words) since the temporary spell-checker database was being created on every suggestion request. I have now committed an optimization so that the (still temporary in-memory) spellchecker database is cached in suggestion searcher.

kelson42 · 2025-09-22T14:51:21Z

I keep thinking testing this on the command line with zim-search would be more efficient.

Do you mean automated testing? Otherwise how can command line usage of suggestion functionality be more efficient compared to interactive instant feedback?

@veloman-yunkan No, I mean zimsearch. https://github.com/openzim/zim-tools/blob/main/src/zimsearch.cpp. I don't want to depend on a web server to test this. Can you implement the feature there so I can test without having:

Results mixed with other results
Having to deal with a browser
A complex software stack like kiwix-serve

kelson42 · 2025-09-22T14:54:43Z

In general it seems to work, but testing with the testset of @gremid, we have a few things difficult to understand, for example ("enttäuschen" should be in the list):

It is in the list as enttauschen (without diacritics) because suggestions are based on indexed data from which diacritics have been removed.

We never display suggestions without diacritics. The indexation is made without diacritics, but was we display is always with diacritics. So I don't understand the argumentation here.

kelson42 · 2025-09-22T14:56:20Z

@kelson42 I guess some of the "issues" that you've experienced with spelling corrections could be caused by the lag in spelling corrections on large ZIM files (with the title vocabulary exceeding a few hundred words) since the temporary spell-checker database was being created on every suggestion request. I have now committed an optimization so that the (still temporary in-memory) spellchecker database is cached in suggestion searcher.

This is an additional reason why we should implement this first in zimsearch.

This is just a starting point for beginning the TDD process of implementing autocompletion and spelling correction support in libzim.

But the stated autocompletion functionality is not yet implemented. SuggestionSearch::getAutocompletionResults() doesn't even get called contrary to expectations. Need to debug it.

It turned out that SuggestionSearch::getAutocompletionResults() stub introduced in the previous commit was actually being called. However instead of returning an empty result set it delivered all entries because of a bug with the usage of zim::Archive::Entry::offset() that was copied from SuggestionSearch::getResults(). That bug has to be fixed too.

This commit somewhat reverts an earlier commit "Added result count limit to SuggestionSearcher::suggest()", and introduces support for auto-completion suggestions (and a similar door to spelling correction suggestions) in a backward compatible way.

Implemented SuggestionSearch::getAutocompletionSuggestions() via fake implementation of some of the newly introduced helper code that only works for the sole unit test addressing the new functionality.

Added more titles to the Suggestion.autoCompletionAndSpellingCorrection unit-test so that it can be enhanced with more checks.

But still works only on unit-test data.

Renamed the unit-test `Suggestion.autoCompletionAndSpellingCorrection` to `Suggestion.smartSuggestions`. Also renamed the macro `EXPECT_SUGGESTION_RESULTS()` to `EXPECT_SMART_SUGGESTION_RESULTS`.

The new unit-test covers SuggestionSearch::getAutocompletionSuggestions() and demonstrates its various shortcomings by passing (rather than failing) due to usage of the results returned by the current implementation as the expected outcome.

Filtered out stemmed terms from the autocompletion database

Autocompletion query is now case & diacritics insensitive. Ignoring diacritics in the query during autocompletion is probably wrong. However, in the current implementation it cannot be easily fixed, since autocompletion works off of the terms recorded in the title index database, and diacritics info is removed during indexing.

This was supposed to be a draft implementation but it doesn't work since support for spelling correction is not implemented in the InMemory backend of Xapian.

Made SpellingsDB work using a Glass Xapian database in an in-memory filesystem. This is only a temporary workaround for the proof-of-concept stage. The plan is to use a better spelling suggestion engine (nuspell) but that will require significant additional effort.

But that may result in duplicated suggestions as demonstrated by some test points in the Suggestion.smartSuggestions unit-test.

The results of obtaining all suggestion terms are memoized in SuggestionDataBase.

Made spellings DB a member of SuggestionsDataBase so that it is not recreated on every call of getSpellingCorrections().

veloman-yunkan mentioned this pull request May 30, 2025

[WIP] Smart suggestions (auto-completion, spelling correction) kiwix/libkiwix#1198

Draft

veloman-yunkan force-pushed the spelling_correction branch 3 times, most recently from e6e7d2c to 84ad890 Compare May 31, 2025 13:21

veloman-yunkan mentioned this pull request May 31, 2025

Offer search term spelling corrections #731

Open

gremid reviewed Jun 1, 2025

View reviewed changes

src/suggestion.cpp Show resolved Hide resolved

veloman-yunkan force-pushed the spelling_correction branch from bfdae6d to 5a90b35 Compare June 12, 2025 13:06

veloman-yunkan force-pushed the spelling_correction branch 2 times, most recently from 6df44d2 to 8d72969 Compare September 26, 2025 11:59

veloman-yunkan mentioned this pull request Sep 26, 2025

Cleanup of suggestions code #1006

Merged

veloman-yunkan changed the base branch from main to suggestions_cleanup September 26, 2025 12:04

veloman-yunkan added 9 commits September 26, 2025 16:39

New unittest Suggestion.autoCompletionAndSpellingCorrection

55da92e

This is just a starting point for beginning the TDD process of implementing autocompletion and spelling correction support in libzim.

Limit on suggestion count in EXPECT_SUGGESTION_RESULTS()

5d52188

Added result count limit to SuggestionSearcher::suggest()

391ca59

But the stated autocompletion functionality is not yet implemented. SuggestionSearch::getAutocompletionResults() doesn't even get called contrary to expectations. Need to debug it.

"Implemented" SuggestionSearch::getAutocompletionSuggestions()

1d683e6

Implemented SuggestionSearch::getAutocompletionSuggestions() via fake implementation of some of the newly introduced helper code that only works for the sole unit test addressing the new functionality.

Richer Suggestion.autoCompletionAndSpellingCorrection unit-test

bb70480

Added more titles to the Suggestion.autoCompletionAndSpellingCorrection unit-test so that it can be enhanced with more checks.

Autocompletion handles multiword queries

f9196d5

But still works only on unit-test data.

Implemented getTermCompletions()

3598328

veloman-yunkan added 10 commits September 26, 2025 16:39

Renamed a unit-test

30a463b

Renamed the unit-test `Suggestion.autoCompletionAndSpellingCorrection` to `Suggestion.smartSuggestions`. Also renamed the macro `EXPECT_SUGGESTION_RESULTS()` to `EXPECT_SMART_SUGGESTION_RESULTS`.

No more fake entries in autocompletion suggestions

a2d21e5

Filtered out stemmed terms from the autocompletion database

Completions require a prefix of at least two letters

f35b542

Enter SuggestionSearch::getSpellingSuggestions()

6cebde6

This was supposed to be a draft implementation but it doesn't work since support for spelling correction is not implemented in the InMemory backend of Xapian.

Spelling corrections are included in smart suggestions

a94111a

But that may result in duplicated suggestions as demonstrated by some test points in the Suggestion.smartSuggestions unit-test.

SuggestionDataBase::getAllSuggestionTerms()

966b2e5

The results of obtaining all suggestion terms are memoized in SuggestionDataBase.

SuggestionDataBase::getSpellingCorrections()

2da49eb

Made spellings DB a member of SuggestionsDataBase so that it is not recreated on every call of getSpellingCorrections().

veloman-yunkan force-pushed the suggestions_cleanup branch from cf894d8 to 5d00100 Compare September 26, 2025 12:40

veloman-yunkan force-pushed the spelling_correction branch from 8d72969 to 2da49eb Compare September 26, 2025 12:40

veloman-yunkan mentioned this pull request Sep 26, 2025

[WIP] Limited version of spelling correction #1007

Draft

kelson42 linked an issue Sep 28, 2025 that may be closed by this pull request

Offer search term spelling corrections #731

Open

Base automatically changed from suggestions_cleanup to main September 29, 2025 12:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP] Support for auto-completion and spelling correction in suggestions #994

[WIP] Support for auto-completion and spelling correction in suggestions #994

Uh oh!

veloman-yunkan commented May 26, 2025 •

edited by kelson42

Loading

Uh oh!

codecov bot commented May 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

kelson42 commented Jun 9, 2025

Uh oh!

veloman-yunkan commented Jun 12, 2025

Uh oh!

veloman-yunkan commented Jun 12, 2025

Uh oh!

veloman-yunkan commented Jun 12, 2025

Uh oh!

veloman-yunkan commented Jun 12, 2025

Uh oh!

veloman-yunkan commented Jun 12, 2025

Uh oh!

kelson42 commented Sep 22, 2025

Uh oh!

kelson42 commented Sep 22, 2025

Uh oh!

kelson42 commented Sep 22, 2025

Uh oh!

Uh oh!

Uh oh!

[WIP] Support for auto-completion and spelling correction in suggestions #994

Are you sure you want to change the base?

[WIP] Support for auto-completion and spelling correction in suggestions #994

Uh oh!

Conversation

veloman-yunkan commented May 26, 2025 • edited by kelson42 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

kelson42 commented Jun 9, 2025

Uh oh!

veloman-yunkan commented Jun 12, 2025

Uh oh!

veloman-yunkan commented Jun 12, 2025

Uh oh!

veloman-yunkan commented Jun 12, 2025

Uh oh!

veloman-yunkan commented Jun 12, 2025

Uh oh!

veloman-yunkan commented Jun 12, 2025

Uh oh!

kelson42 commented Sep 22, 2025

Uh oh!

kelson42 commented Sep 22, 2025

Uh oh!

kelson42 commented Sep 22, 2025

Uh oh!

Uh oh!

veloman-yunkan commented May 26, 2025 •

edited by kelson42

Loading

codecov bot commented May 26, 2025 •

edited

Loading