Protection against too long single word titles #1004

veloman-yunkan · 2025-09-14T16:33:43Z

Fixes #992

Xapian's parser discards words/terms longer than a certain limit (default: 64). However our workaround for handling titles that are made fully of non-word characters has created a loophole for words of arbitrary length to slip in. That leads to crashes if Xapian's
hard limit on the length of a term (max 245 bytes) is exceeded.

This PR closes the described loophole and validates the fix with a new unit test.

Note that an alternative fix is possible (indexing the truncated title instead of discarding it completel) but that would complicate the verification of the bug fix (because the potentially dangerous title would remain in the expected output of the unit test and the unit test would have to be converted into a less explicit crash/no-crash variant).

Xapian's parser (TermGenerator) discards words/terms longer than a certain limit (default: 64). However terms can be added to the indexed documents directly via Xapian::Document::add_term(). We do that in order to index titles that are made fully of non-word characters but our implementation opens up a loophole for words of arbitrary length to slip in (when the title is a single word). That leads to crashes if Xapian's hard limit on the length of a term (max 245 characters) is exceeded. The new unit tests demonstrates the existence of a loophole.

Titles consisting of a single word exceeding the max word length limit are not indexed (similarly to how that long word would be handled if it were not the only word in the title). An alternative might be to index a truncated version of the title but that would complicate the verification of the bug fix (because the potentially dangerous title would remain in the expected output of the unit test).

codecov · 2025-09-14T16:38:00Z

Codecov Report

❌ Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.16%. Comparing base (abf8d4a) to head (e5b844b).

Files with missing lines	Patch %	Lines
src/writer/xapianIndexer.cpp	50.00%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1004   +/-   ##
=======================================
  Coverage   58.15%   58.16%           
=======================================
  Files         100      100           
  Lines        5325     5328    +3     
  Branches     2179     2181    +2     
=======================================
+ Hits         3097     3099    +2     
  Misses        778      778           
- Partials     1450     1451    +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kelson42

@veloman-yunkan Thank you for tackling this issue. If I get it right your fix allows to ignore silently the entry requested to be indexed and moves forward.

If confirmed, this is not the right approach I believe. The correct approach IMO is to trigger an exception and then the software using the libzim should decide what to do with it:

Either ignore (like you have implemented at the low level)
Truncate (like @benoit74 did)
Stop and request the software user to rename the entry he wants to index

Doing so will allow, in addition, to the scraper developer/user to write/get a custom error message.

I understand that this brings complexity one level higher on the software stack, but I don't want us to just ignore things deep in the code and let the user (libzim user) believing that everything was OK.

benoit74

Some feedbacks:

completely ignoring too long words is probably going to be deceptive to our users ; at least the case we have encountered in mwoffliner is an WPAR (arabic) article with only one single very long arabic word as title ; if we ignore this title, then it means this article will not be searchable at all in suggestions, which I'm quite sure will be reported as a bug by users ; truncating seems more appealing and easy to understand for users from my PoV, and the reason you've mentioned (testing complexity) for not implementing this is not really convincing me
I don't see any explanation of why we stick with Xapian default 64 bytes limit while we could support 245 bytes ; 64 bytes is not that much especially for alphabets needing 2, 3 or even 4 bytes per character (and again in these languages we tend to often have single "words" titles, think e.g. about Chinese)
Tests miss cases where redirects titles are too long (quite important to check this works too should handling of titles become different for redirects)
It was explained in issue that this problem was caused by glass (default) Xapian backend ; I do not see any issue or code comment mentioning we can remove this limit on word length once we change Xapian backend, and I think it is mostly mandatory since this limit has not been pushed by any usability concern but only by a technical limitation

benoit74 · 2025-09-15T06:22:16Z

Just saw kelson42 comment, I'm quite aligned with the idea of pushing the problem to the libzim "user" with a clear exception. We should mention this somewhere in the documentation (not saying that users will easily find this documentation, but at least we will be able to point to this documentation whenever needed).

I like the fact that we do not take decision at a low level for a problem which might have different solutions depending on the context (truncating vs dropping vs altering too long word).

I'm a bit concerned about the fact that we will start to push technical Xapian limitations to the "user". It means that it will be harder to maintain in the long run, e.g. once Xapian technical changes, we will have to modify all scrapers. But I feel like it is an acceptable price to pay for flexibility.

veloman-yunkan · 2025-09-15T11:15:27Z

If confirmed, this is not the right approach I believe. The correct approach IMO is to trigger an exception and then the software using the libzim should decide what to do with it:

@kelson42 I can agree with such an approach but we won't be able to implement it (at least, as easily) for the case of titles consisting of two or more words. Then our implementation will be inconsistent - long words inside multiword titles will be silently ignored, whereas in single-word titles the user will be in control.

kelson42 · 2025-09-22T15:02:31Z

If confirmed, this is not the right approach I believe. The correct approach IMO is to trigger an exception and then the software using the libzim should decide what to do with it:

@kelson42 I can agree with such an approach but we won't be able to implement it (at least, as easily) for the case of titles consisting of two or more words. Then our implementation will be inconsistent - long words inside multiword titles will be silently ignored, whereas in single-word titles the user will be in control.

@veloman-yunkan I don't understand why " for the case of titles consisting of two or more words" they will be silently ignored. Sorry.

veloman-yunkan · 2025-09-22T15:57:32Z

If confirmed, this is not the right approach I believe. The correct approach IMO is to trigger an exception and then the software using the libzim should decide what to do with it:

@kelson42 I can agree with such an approach but we won't be able to implement it (at least, as easily) for the case of titles consisting of two or more words. Then our implementation will be inconsistent - long words inside multiword titles will be silently ignored, whereas in single-word titles the user will be in control.

@veloman-yunkan I don't understand why " for the case of titles consisting of two or more words" they will be silently ignored. Sorry.

We treat certain titles as a special case (see #765) and it's only in our handling of such titles that we can intervene. For titles failing to meet our special case criteria Xapian is fully in charge.

Revisiting what the special case is I see that it is more complex than a single-word title. The general formulation is "text that appears empty to the Xapian parser", which happens when the text is made up of exclusively the following types of words:

"pseudo-words" consisting of punctuation only
stopwords
words exceeding the max length limit

Thus a multi-word title all of the words in which are too long will also pass through our special handling, but dealing with it will be more difficult because we will have to parse it!

Having written all this I now realize that my fix works under the wrong assumption (and #765 is also affected by it).

kelson42 · 2025-09-22T17:14:44Z

Thank you for the explanation, but I still don't understand why "we won't be able to implement" a detection (in the special case treatment) of the words which are too long and trigger an exception when at least one of them is too long.

veloman-yunkan · 2025-09-25T14:34:58Z

We are able to implement anything within the current limits of the public knowledge and technology. But reimplementing parts of libzim's dependencies in order to fix some of their shortcomings is suboptimal. In this case we will have to tinker with parsing the text (something that Xapian did fully on its own).

kelson42 · 2025-09-26T04:20:15Z

@veloman-yunkan If you think that Xapian does not fully does the job, why we don't create an upstream issue then?

veloman-yunkan added 2 commits September 14, 2025 20:16

veloman-yunkan requested review from kelson42 and benoit74 September 14, 2025 16:34

veloman-yunkan mentioned this pull request Sep 14, 2025

Xapian issue when title is too "long" #992

Open

kelson42 reviewed Sep 15, 2025

View reviewed changes

benoit74 reviewed Sep 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Protection against too long single word titles #1004

Protection against too long single word titles #1004

Uh oh!

veloman-yunkan commented Sep 14, 2025

Uh oh!

codecov bot commented Sep 14, 2025 •

edited

Loading

Uh oh!

kelson42 left a comment •

edited

Loading

Uh oh!

benoit74 left a comment

Uh oh!

benoit74 commented Sep 15, 2025

Uh oh!

veloman-yunkan commented Sep 15, 2025

Uh oh!

kelson42 commented Sep 22, 2025

Uh oh!

veloman-yunkan commented Sep 22, 2025

Uh oh!

kelson42 commented Sep 22, 2025 •

edited

Loading

Uh oh!

veloman-yunkan commented Sep 25, 2025 •

edited

Loading

Uh oh!

kelson42 commented Sep 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Protection against too long single word titles #1004

Are you sure you want to change the base?

Protection against too long single word titles #1004

Uh oh!

Conversation

veloman-yunkan commented Sep 14, 2025

Uh oh!

codecov bot commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kelson42 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benoit74 left a comment

Choose a reason for hiding this comment

Uh oh!

benoit74 commented Sep 15, 2025

Uh oh!

veloman-yunkan commented Sep 15, 2025

Uh oh!

kelson42 commented Sep 22, 2025

Uh oh!

veloman-yunkan commented Sep 22, 2025

Uh oh!

kelson42 commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

veloman-yunkan commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kelson42 commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 14, 2025 •

edited

Loading

kelson42 left a comment •

edited

Loading

kelson42 commented Sep 22, 2025 •

edited

Loading

veloman-yunkan commented Sep 25, 2025 •

edited

Loading

kelson42 commented Sep 26, 2025 •

edited

Loading