Skip to content

Conversation

mgautierfr
Copy link
Contributor

@mgautierfr mgautierfr commented Feb 28, 2025

Fix #617 (preloading xapian database)

Numbers do not show a strong win (not a lost neither) but this PR also introduce nice improvement as side effect and so is a good preparatory work for next project about multi zim and multi lang issues.

Globally this PR does:

  • Move xapian metadata out of internalDataBase
  • Separate xapian database (and metadata) from internalDataBase. XapianDb is about a specific database stored in a zim file. InternalDataBase is about a database we search in (which "include" XapianDbs)
  • Move opening of xapian database in fileImpl (on the archive side instead of on the search side)
  • Open only on xapian db per archive. This as the side effect that searching is now thread safe. (We introduce mutex lock system to do so)
  • We introduce a new public method Archive::preloadXapianDb to explicitly load the xapian db.

The numbers

Globally, what is taking time is loading the data from fs into memory. And this is the purpose of this PR to load this data upfront.
So there is two use case to evaluate: a cold one (data is not loading) and warm one (data is loaded).
On top of that, locating the xapian db or using the search results imply loading entries from zim file to get the title. Accessing zim dirent will activate our dirent lookup cache and we will need to populate it (and we know it can take some time, that was the purpose of #950 to be able to deactivate it)

Numbers are obtained by instrumenting zimsearch and running zimsearch wikispecies_en_all_maxi_2020-06.zim home (and taking the mean of 3 runs).
Numbers are in milliseconds. Main number is cold. Number is parentheses is warm.
Search (w getTitle) is the time to do the search and use it. It include the time to get the title (from zim dirents) of all results (getTitle). Search (wo getTitle) is simply the difference between both, this is the real time taken by xapian to do the search.

Opening no preload preload  
no cache 13 (1) 13 (1)
cache 13 (1) 13 (1)
Preloading no preload preload  
no cache 0 2 (0.6)
cache 0 155 (8)
Create search no preload preload  
no cache 2 (0.7) 0.1 (0.1)
cache 170 (10) 0.1 (0.1)
Search (w getTitle) no preload preload  
no cache 224 (21) 250 (20)
cache 147 (19) 107 (16)
getTitle no preload preload  
no cache 209 (9) 235 (9)
cache 132 (9) 94 (8)
Search (wo getTitle) no preload preload  
no cache 15 (11) 14 (11)
cache 15 (10) 13 (8)
Total no preload preload  
no cache 240 (22) 265 (22)
cache 330 (31) 276 (25)

From Preloading table, we can see that loading the xapian DB itself is pretty fast, only 2ms.
When dirent lookup is activated, it take 155ms but this is mostly the time to populate the dirent lookup cache. We found those numbers in Create search (no preload) as we load the xapian db (and populate the dirent lookup cache) when we start the search.

The search performance itself is not really changed. Warm run is a bit faster but not a game changer.

Exploiting the search result (getTitle) also shows mitigated results:

  • dirent lookup improve dirent search
  • preload of xapian db has different impact depending of dirent lookup activated or not. However, it should have no impact at all.

For the Total, preloading and dirent lookup cache also shows strange results (preloading should not impact the total, dirent lookup cache should have the same impact, what ever preloading is)

At the end, it seem that most of the time spend during a search is about loading the dirent from zim file (either at dirent lookup population or at get title), not from xapian.
As said in the introduction, this PR is still interresting for its side effects:

  • Thread safe search
  • Only one xapian db open per zim file

@benoit74
Copy link

You did not mentioned it, but did you ran benchmarking (The numbers) on your beefy dev machine?

I think it would be important to make same measurements on a Raspberry Pi 3 (or even a Raspberry Pi Zero W) with the ZIM on an SD card. I imagine numbers will be significantly different, and the benefit of preloading Xapian database could be more visible since this is kinda the "worst case scenario" (at least it is the scenario of original issue where we had bad numbers in Cold vs Hot cases). Maybe testing on a real Android device (e.g. low-end tablet) would be fun as well (or not ^^ I imagine you don't have sufficient tooling for this just like I do).

@mgautierfr mgautierfr force-pushed the xapian_preload branch 2 times, most recently from cea50cf to d2e4ae7 Compare February 28, 2025 11:44
@mgautierfr
Copy link
Contributor Author

Indeed. I have a pi3, I will run the same test on it. And I confirm I have no android device, I can send you the binaries if you have one and want to test.

@benoit74
Copy link

I can send you the binaries if you have one and want to test

I have no idea about how to run a binary on an Android device. Is it just a matter of starting an adb shell to the device and starting the binary? If so (or something straightforward like this), then yes I can happily help with this.

@mgautierfr
Copy link
Contributor Author

mgautierfr commented Feb 28, 2025

Here the numbers on a Pi3:

Opening no preload preload  
no cache 2806 (299) 2800 (299)
cache 2866 (299) 2810 (298)
Preloading no preload preload  
no cache 0 131 (4)
cache 0 1206 (25)
Create search no preload preload  
no cache 145 (4) 6 (0.1)
cache 1229 (25) 6 (0.1)
Search (w getTitle) no preload preload  
no cache 7801 (368) 7754 (368)
cache 6602 (352) 6549 (364)
getTitle no preload preload  
no cache 7015 (176) 6972 (176)
cache 5597 (161) 5578 (172)
Search (wo getTitle) no preload preload  
no cache 787 (191) 781 (191)
cache 1005 (191) 880 (192)
Total no preload preload  
no cache 10752 (672) 10693 (671)
cache 10698 (676) 10482 (688)

While the numbers are obviously greater, the same schema appears. Opening the xapian db is about 131ms, searching (xapian only) is about 850ms. So we spend around 1s in xapian and 9s in zim itself.

@mgautierfr
Copy link
Contributor Author

Is it just a matter of starting an adb shell to the device and starting the binary ?

Yes, it should be as simple as it sound.

@benoit74
Copy link

benoit74 commented Mar 2, 2025

Yes, it should be as simple as it sound.

Please send me the binary and the ZIM file then, I will give it a try (mostly for curiosity, since numbers on Pi3 seems to speak for themselves)

@mgautierfr
Copy link
Contributor Author

Please send me the binary

I cannot because of this : kiwix/kiwix-build#808

@kelson42
Copy link
Contributor

kelson42 commented Mar 3, 2025

@veloman-yunkan Can you please do the necessary to fix kiwix-build?

@veloman-yunkan
Copy link
Collaborator

@veloman-yunkan Can you please do the necessary to fix kiwix-build?

@kelson42 I will

@kelson42
Copy link
Contributor

kelson42 commented Mar 4, 2025

@mgautierfr Does the documentation (readthedocs) has been changed? It is not very clear to me right now how to use this feature and what has changed in term of interface for the developer.

@mgautierfr
Copy link
Contributor Author

mgautierfr commented Mar 4, 2025

Does the documentation (readthedocs) has been changed?

It will be once merged. readthedocs documentation is generated from comments in source code. You can see the rtd for the pr at : https://libzim--958.org.readthedocs.build/en/958/
Once merged, it will become the "main" doc.

Saying that, I've just realize that the doc is generated as if libzim was not compiled with xapian. So some methods are not present (especially Archive::preloadXapianDb here). But it is an issue (#959) to handle separately from this PR.

It is not very clear to me right now how to use this feature and what has changed in term of interface for the developer.

User have to call Archive::preloadXapianDb to ... preload the xapian db.
Other change is that now search is thread safe so user don't have to protect search related calls with its own mutex. It is a relax on a usage constraints but nothing changed on the API itself.

@kelson42
Copy link
Contributor

kelson42 commented Mar 4, 2025

@mgautierfr From what I see there, the preloading of the Xapian indexes is not and option to specify at ZIM archive.

@mgautierfr
Copy link
Contributor Author

Yes, because of this :

Saying that, I've just realize that the doc is generated as if libzim was not compiled with xapian. So some methods are not present (especially Archive::preloadXapianDb here). But it is an issue (#959) to handle separately from this PR.

@veloman-yunkan
Copy link
Collaborator

@veloman-yunkan Can you please do the necessary to fix kiwix-build?

@kelson42 I will

Done

@kelson42
Copy link
Contributor

kelson42 commented Mar 5, 2025

@mgautierfr preloading xapian index has to be an option of the opening primitives, not a dedicated public API.

@mgautierfr
Copy link
Contributor Author

@kelson42
Copy link
Contributor

@veloman-yunkan Can you please complete the review of the PR and fix kiwix-build if necessary? This PR has been stalled for too long

src/search.cpp Outdated
Comment on lines 83 to 118
m_valuesmap = read_valuesmap(database.get_metadata("valuesmap"));
auto language = database.get_metadata("language");
if (language.empty() ) {
// Database created before 2017/03 has no language metadata.
// However, term were stemmed anyway and we need to stem our
// search query the same the database was created.
// So we need a language, let's use the one of the zim.
// If zimfile has no language metadata, we can't do lot more here :/
try {
language = archive.getMetadata("Language");
} catch(...) {}
}
if (!language.empty()) {
icu::Locale languageLocale(language.c_str());
/* Configuring language base steemming */
try {
m_stemmer = Xapian::Stem(languageLocale.getLanguage());
m_queryParser.set_stemmer(m_stemmer);
m_queryParser.set_stemming_strategy(Xapian::QueryParser::STEM_ALL);
} catch (...) {
std::cout << "No stemming for language '" << languageLocale.getLanguage() << "'" << std::endl;
}
}
auto stopwords = database.get_metadata("stopwords");
if ( !stopwords.empty() ){
std::string stopWord;
std::istringstream file(stopwords);
Xapian::SimpleStopper* stopper = new Xapian::SimpleStopper();
while (std::getline(file, stopWord, '\n')) {
stopper->add(stopWord);
}
stopper->release();
m_queryParser.set_stopper(stopper);
}
} else {
std::map<std::string, int> valuesmap = read_valuesmap(database.get_metadata("valuesmap"));
if (m_valuesmap != valuesmap ) {
// [TODO] Ignore the database, raise a error ?
}
std::string defaultLanguage;
// Database created before 2017/03 has no language metadata.
// However, term were stemmed anyway and we need to stem our
// search query the same the database was created.
// So we need a language, let's use the one of the zim.
// If zimfile has no language metadata, we can't do lot more here :/
try {
defaultLanguage = archive.getMetadata("Language");
} catch(...) {}

m_metadata = XapianDbMetadata(database, defaultLanguage);
m_queryParser.set_stemmer(m_metadata.m_stemmer);
m_queryParser.set_stemming_strategy(Xapian::QueryParser::STEM_ALL);
m_queryParser.set_stopper(m_metadata.new_stopper());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit containing this change (titled "Introduce XapianDbMetadata") is too big. I think it can be broken down to a series of small refactoring changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, how you would split this commit ?
It seems to me that breaking the commit would be some pretty artificial commits for the reviewer.

Copy link
Collaborator

@veloman-yunkan veloman-yunkan Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as an example, you could extract a piece of code of the InternalDataBase constructor into a create_stopper() helper function (that would later become the XapianDbMetadata::new_stopper() member function with minimal changes). That way it would be easier to focus during review on one change at a time and make sure that the final result is simply a reorganization of the source code without any change in functionality (or if there is some change of functionality it would be easier to spot/understand).

src/fileimpl.cpp Outdated
Comment on lines 895 to 882
auto langDirent = getDirent(r.second);
while (langDirent->isRedirect()) {
langDirent = getDirent(langDirent->getRedirectIndex());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a metadata item allowed to be a redirect?

In any case, this reimplementation of zim::Archive::getMetadata() looks ugly here. Maybe you should pass defaultLanguage as a parameter of FileImpl::getXapianDb()? I see in the next commit that it won't work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a metadata item allowed to be a redirect?

I remember that we agree that Metadata must not be redirect. But I don't find any reference to this decision (instead that libzim itself cannot create redirect metadata).

But it's better to be tolerant here, especially as the test is almost free to do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's better to be tolerant here, especially as the test is almost free to do.

If we can't be certain that a metadata entry is not a redirect, then we should also not ignore the possibility that it can redirect to an infinite loop of redirects.

I also propose that this duplication of metadata access code is extracted into a helper function in an unnamed namespace. It will make the code more readable and will make it easier to reuse should you have to access other metadata in the early stage of opening a ZIM archive.

@veloman-yunkan
Copy link
Collaborator

@veloman-yunkan Can you please complete the review of the PR and fix kiwix-build if necessary? This PR has been stalled for too long

@kelson42 Review done. I don't understand what I should do about kiwix-build concerning this PR.

@kelson42
Copy link
Contributor

@veloman-yunkan Can you please complete the review of the PR and fix kiwix-build if necessary? This PR has been stalled for too long

@kelson42 Review done. I don't understand what I should do about kiwix-build concerning this PR.

The two CI for Linux don't pass! Why so?

src/search.cpp Outdated
m_archives.push_back(archive);
}

mp_mutexes = MultiMutex::create(mutexes.begin(), mutexes.end());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the current implementation of MultiMutex and its usage doesn't protect against deadlocks. To the best of my understanding the defense against deadlocks is to always obtain locks on multiple mutexes in the same order and std::lock() should do exactly that. However you don't call std::lock() on the full set of mutexes at once and try to work around that via MultiMutex.

Now suppose that there are three ZIM archives A, B & C and std::lock()'s algorithm would lock them in that order if it had all of them at once. Then consider the following zim::Searcher objects:

  • zim::Searcher({A, B}): the underlying MultiMutex is MultiMutex(A, MMB) where MMB = MultiMutex(B)
  • zim::Searcher({B, C}): the underlying MultiMutex is MultiMutex(B, MMC) where MMC = MultiMutex(C)
  • zim::Searcher({C, B}): the underlying MultiMutex is MultiMutex(C, MMA) where MMA = MultiMutex(A)

It's not clear how MMA, MMB and MMC are ordered with respect to A, B and C from std::lock()'s point of view. Let assume that MM*s fall behind A, B and C. Then, if three concurrent searches are initiated, you may end up with the following concurrent calls to std::lock():

  • std::lock(A, MMB)
  • std::lock(B, MMC)
  • std::lock(C, MMA)

As a result, all of A, B and C may get locked before proceeding to (attempt to) lock them again via the treacherous proxy objects MMA, MMB and MMC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the point of std::lock is that it «Locks the given Lockable objects lock1, lock2, ..., lockn using a deadlock avoidance algorithm to avoid deadlock.» (https://en.cppreference.com/w/cpp/thread/lock)

So in you example, if std::lock(A, MMB) succeed to lock A but cannot lock MMB, I trust the stdlib to unlock A to avoid deadlock. I don't know the exact algorithm, but avoid deadlock is part of the contract of std::lock.

It would be a pity that it doesn't avoid deadlock as threadsafe is one (if not the only one) of nice improvements brought by this PR.

Copy link
Collaborator

@veloman-yunkan veloman-yunkan Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in you example, if std::lock(A, MMB) succeed to lock A but cannot lock MMB, I trust the stdlib to unlock A to avoid deadlock. I don't know the exact algorithm, but avoid deadlock is part of the contract of std::lock.

Indeed, by looking at the code it works like that. Then it may result in livelock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have sorted the mutexes before locking them. I think this should be enough to avoid the livelock.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With sorted mutexes does such an implementation of MultiMutex have any advantage compared to a simpler one where mutexes are sorted inside the MultiMutex constructor, stored in a vector and locked/unlocked in a loop? A known disadvantage of some implementations of std::lock() is the busy loop when trying to obtain all the locks.

Copy link
Contributor

@kelson42 kelson42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deb package does not build anymore on ubuntu 24.04 because a problem in memcopy https://github.com/openzim/libzim/actions/runs/14224960852/job/39862160011?pr=958#step:12:459

Copy link

codecov bot commented Apr 16, 2025

Codecov Report

Attention: Patch coverage is 49.56522% with 116 lines in your changes missing coverage. Please review.

Project coverage is 57.53%. Comparing base (55e7458) to head (f971c71).
Report is 68 commits behind head on main.

Files with missing lines Patch % Lines
src/fileimpl.cpp 48.48% 15 Missing and 36 partials ⚠️
src/search.cpp 36.00% 0 Missing and 32 partials ⚠️
src/archive.cpp 48.14% 6 Missing and 8 partials ⚠️
src/search_iterator.cpp 12.50% 0 Missing and 7 partials ⚠️
src/tools.cpp 12.50% 3 Missing and 4 partials ⚠️
src/item.cpp 50.00% 1 Missing and 1 partial ⚠️
src/search_internal.h 71.42% 0 Missing and 2 partials ⚠️
src/lock.h 92.85% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #958      +/-   ##
==========================================
- Coverage   57.57%   57.53%   -0.05%     
==========================================
  Files          98       99       +1     
  Lines        4625     4724      +99     
  Branches     1925     1976      +51     
==========================================
+ Hits         2663     2718      +55     
- Misses        695      704       +9     
- Partials     1267     1302      +35     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kelson42
Copy link
Contributor

Great, we should merge this once 9.3.0 released

Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Not giving formal approval so that this is not accidentally merged before 9.3.0 is released.

Ideally this helper should be in a anonymous namespace but we will move
soon in next commits.
This way we ensure ouselves that we will not mess up with `Xapian::Query`.
This struct will help us to store Xapian's metadata, independantly from
the Searcher.
We don't need it. No need to store it.
Using a structure provides better meaning that a simple pair.
Note this commit introduce a race condition.

As we now open only one xapian db per archive, we may have different
searchers using the same xapian db in the same time.
We will need to lock all db mutex in the same time when doing multizim
search. And we must avoid dead locks in the process.

Std lib provides us a `std::lock()` [1] function to lock several mutex
in the same time. However, it is a variatic function and it doesn't take
a list of mutex to lock.

So we havet to implemented ourselves. But we don't want to re implement
an error prone algorithm. To do so, we introduce a `MultiMutex` (a
linked list) and we (ab)use the fact that `std::lock()` is a template
taking any object "Lockable" [2]

[1] https://en.cppreference.com/w/cpp/thread/lock
[2] https://en.cppreference.com/w/cpp/named_req/Lockable
We finally lock all xapian Db mutexes in the same time before when
access them.
Now m_direntLookup is created at fileimpl creation, not at first (user)
access to the dirents.

This is kind of useless to delay the dirent_lookup creation until the
first access as user will need to create it to use the archive.

For the rare case when this is not needed (search only, iter efficient),
user can now simply deactivate the dirent lookup.
On Windows, static member are not exported in dlls.
So we don't have a value to use as default for openConfig.
@kelson42
Copy link
Contributor

Now that we have released 9.3.0, I will rebase and merge

@kelson42 kelson42 changed the title Preload xapian database Preload Xapian database Apr 17, 2025
Copy link
Contributor

@kelson42 kelson42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kelson42 kelson42 merged commit 0787eb0 into main Apr 18, 2025
32 of 34 checks passed
@kelson42 kelson42 deleted the xapian_preload branch April 18, 2025 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Speed-up Xapian searches by preloading indexes

4 participants