-
-
Notifications
You must be signed in to change notification settings - Fork 62
Preload Xapian database #958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
99506fd
to
da0ad2f
Compare
You did not mentioned it, but did you ran benchmarking ( I think it would be important to make same measurements on a Raspberry Pi 3 (or even a Raspberry Pi Zero W) with the ZIM on an SD card. I imagine numbers will be significantly different, and the benefit of preloading Xapian database could be more visible since this is kinda the "worst case scenario" (at least it is the scenario of original issue where we had bad numbers in Cold vs Hot cases). Maybe testing on a real Android device (e.g. low-end tablet) would be fun as well (or not ^^ I imagine you don't have sufficient tooling for this just like I do). |
cea50cf
to
d2e4ae7
Compare
Indeed. I have a pi3, I will run the same test on it. And I confirm I have no android device, I can send you the binaries if you have one and want to test. |
I have no idea about how to run a binary on an Android device. Is it just a matter of starting an adb shell to the device and starting the binary? If so (or something straightforward like this), then yes I can happily help with this. |
Here the numbers on a Pi3:
While the numbers are obviously greater, the same schema appears. Opening the xapian db is about 131ms, searching (xapian only) is about 850ms. So we spend around 1s in xapian and 9s in zim itself. |
Yes, it should be as simple as it sound. |
Please send me the binary and the ZIM file then, I will give it a try (mostly for curiosity, since numbers on Pi3 seems to speak for themselves) |
I cannot because of this : kiwix/kiwix-build#808 |
@veloman-yunkan Can you please do the necessary to fix kiwix-build? |
@kelson42 I will |
@mgautierfr Does the documentation (readthedocs) has been changed? It is not very clear to me right now how to use this feature and what has changed in term of interface for the developer. |
It will be once merged. readthedocs documentation is generated from comments in source code. You can see the rtd for the pr at : https://libzim--958.org.readthedocs.build/en/958/ Saying that, I've just realize that the doc is generated as if libzim was not compiled with xapian. So some methods are not present (especially
User have to call |
@mgautierfr From what I see there, the preloading of the Xapian indexes is not and option to specify at ZIM archive. |
Yes, because of this :
|
|
@mgautierfr preloading xapian index has to be an option of the opening primitives, not a dedicated public API. |
d2e4ae7
to
a393cae
Compare
72276cf
to
49e9136
Compare
@veloman-yunkan Can you please complete the review of the PR and fix kiwix-build if necessary? This PR has been stalled for too long |
src/search.cpp
Outdated
m_valuesmap = read_valuesmap(database.get_metadata("valuesmap")); | ||
auto language = database.get_metadata("language"); | ||
if (language.empty() ) { | ||
// Database created before 2017/03 has no language metadata. | ||
// However, term were stemmed anyway and we need to stem our | ||
// search query the same the database was created. | ||
// So we need a language, let's use the one of the zim. | ||
// If zimfile has no language metadata, we can't do lot more here :/ | ||
try { | ||
language = archive.getMetadata("Language"); | ||
} catch(...) {} | ||
} | ||
if (!language.empty()) { | ||
icu::Locale languageLocale(language.c_str()); | ||
/* Configuring language base steemming */ | ||
try { | ||
m_stemmer = Xapian::Stem(languageLocale.getLanguage()); | ||
m_queryParser.set_stemmer(m_stemmer); | ||
m_queryParser.set_stemming_strategy(Xapian::QueryParser::STEM_ALL); | ||
} catch (...) { | ||
std::cout << "No stemming for language '" << languageLocale.getLanguage() << "'" << std::endl; | ||
} | ||
} | ||
auto stopwords = database.get_metadata("stopwords"); | ||
if ( !stopwords.empty() ){ | ||
std::string stopWord; | ||
std::istringstream file(stopwords); | ||
Xapian::SimpleStopper* stopper = new Xapian::SimpleStopper(); | ||
while (std::getline(file, stopWord, '\n')) { | ||
stopper->add(stopWord); | ||
} | ||
stopper->release(); | ||
m_queryParser.set_stopper(stopper); | ||
} | ||
} else { | ||
std::map<std::string, int> valuesmap = read_valuesmap(database.get_metadata("valuesmap")); | ||
if (m_valuesmap != valuesmap ) { | ||
// [TODO] Ignore the database, raise a error ? | ||
} | ||
std::string defaultLanguage; | ||
// Database created before 2017/03 has no language metadata. | ||
// However, term were stemmed anyway and we need to stem our | ||
// search query the same the database was created. | ||
// So we need a language, let's use the one of the zim. | ||
// If zimfile has no language metadata, we can't do lot more here :/ | ||
try { | ||
defaultLanguage = archive.getMetadata("Language"); | ||
} catch(...) {} | ||
|
||
m_metadata = XapianDbMetadata(database, defaultLanguage); | ||
m_queryParser.set_stemmer(m_metadata.m_stemmer); | ||
m_queryParser.set_stemming_strategy(Xapian::QueryParser::STEM_ALL); | ||
m_queryParser.set_stopper(m_metadata.new_stopper()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit containing this change (titled "Introduce XapianDbMetadata") is too big. I think it can be broken down to a series of small refactoring changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, how you would split this commit ?
It seems to me that breaking the commit would be some pretty artificial commits for the reviewer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as an example, you could extract a piece of code of the InternalDataBase
constructor into a create_stopper()
helper function (that would later become the XapianDbMetadata::new_stopper()
member function with minimal changes). That way it would be easier to focus during review on one change at a time and make sure that the final result is simply a reorganization of the source code without any change in functionality (or if there is some change of functionality it would be easier to spot/understand).
src/fileimpl.cpp
Outdated
auto langDirent = getDirent(r.second); | ||
while (langDirent->isRedirect()) { | ||
langDirent = getDirent(langDirent->getRedirectIndex()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a metadata item allowed to be a redirect?
In any case, this reimplementation of zim::Archive::getMetadata()
looks ugly here. Maybe you should pass I see in the next commit that it won't work.defaultLanguage
as a parameter of FileImpl::getXapianDb()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a metadata item allowed to be a redirect?
I remember that we agree that Metadata must not be redirect. But I don't find any reference to this decision (instead that libzim itself cannot create redirect metadata).
But it's better to be tolerant here, especially as the test is almost free to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's better to be tolerant here, especially as the test is almost free to do.
If we can't be certain that a metadata entry is not a redirect, then we should also not ignore the possibility that it can redirect to an infinite loop of redirects.
I also propose that this duplication of metadata access code is extracted into a helper function in an unnamed namespace. It will make the code more readable and will make it easier to reuse should you have to access other metadata in the early stage of opening a ZIM archive.
@kelson42 Review done. I don't understand what I should do about |
The two CI for Linux don't pass! Why so? |
src/search.cpp
Outdated
m_archives.push_back(archive); | ||
} | ||
|
||
mp_mutexes = MultiMutex::create(mutexes.begin(), mutexes.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the current implementation of MultiMutex
and its usage doesn't protect against deadlocks. To the best of my understanding the defense against deadlocks is to always obtain locks on multiple mutexes in the same order and std::lock()
should do exactly that. However you don't call std::lock()
on the full set of mutexes at once and try to work around that via MultiMutex
.
Now suppose that there are three ZIM archives A
, B
& C
and std::lock()
's algorithm would lock them in that order if it had all of them at once. Then consider the following zim::Searcher
objects:
zim::Searcher({A, B})
: the underlyingMultiMutex
isMultiMutex(A, MMB)
whereMMB = MultiMutex(B)
zim::Searcher({B, C})
: the underlyingMultiMutex
isMultiMutex(B, MMC)
whereMMC = MultiMutex(C)
zim::Searcher({C, B})
: the underlyingMultiMutex
isMultiMutex(C, MMA)
whereMMA = MultiMutex(A)
It's not clear how MMA
, MMB
and MMC
are ordered with respect to A
, B
and C
from std::lock()
's point of view. Let assume that MM*
s fall behind A
, B
and C
. Then, if three concurrent searches are initiated, you may end up with the following concurrent calls to std::lock()
:
std::lock(A, MMB)
std::lock(B, MMC)
std::lock(C, MMA)
As a result, all of A
, B
and C
may get locked before proceeding to (attempt to) lock them again via the treacherous proxy objects MMA
, MMB
and MMC
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the point of std::lock
is that it «Locks the given Lockable objects lock1, lock2, ..., lockn using a deadlock avoidance algorithm to avoid deadlock.» (https://en.cppreference.com/w/cpp/thread/lock)
So in you example, if std::lock(A, MMB)
succeed to lock A
but cannot lock MMB
, I trust the stdlib to unlock A
to avoid deadlock. I don't know the exact algorithm, but avoid deadlock is part of the contract of std::lock
.
It would be a pity that it doesn't avoid deadlock as threadsafe is one (if not the only one) of nice improvements brought by this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in you example, if
std::lock(A, MMB)
succeed to lockA
but cannot lockMMB
, I trust the stdlib to unlockA
to avoid deadlock. I don't know the exact algorithm, but avoid deadlock is part of the contract ofstd::lock
.
Indeed, by looking at the code it works like that. Then it may result in livelock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have sorted the mutexes before locking them. I think this should be enough to avoid the livelock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With sorted mutexes does such an implementation of MultiMutex
have any advantage compared to a simpler one where mutexes are sorted inside the MultiMutex
constructor, stored in a vector and locked/unlocked in a loop? A known disadvantage of some implementations of std::lock()
is the busy loop when trying to obtain all the locks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deb package does not build anymore on ubuntu 24.04 because a problem in memcopy https://github.com/openzim/libzim/actions/runs/14224960852/job/39862160011?pr=958#step:12:459
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #958 +/- ##
==========================================
- Coverage 57.57% 57.53% -0.05%
==========================================
Files 98 99 +1
Lines 4625 4724 +99
Branches 1925 1976 +51
==========================================
+ Hits 2663 2718 +55
- Misses 695 704 +9
- Partials 1267 1302 +35 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Great, we should merge this once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Not giving formal approval so that this is not accidentally merged before 9.3.0 is released.
Ideally this helper should be in a anonymous namespace but we will move soon in next commits.
This way we ensure ouselves that we will not mess up with `Xapian::Query`.
This struct will help us to store Xapian's metadata, independantly from the Searcher.
We don't need it. No need to store it.
Using a structure provides better meaning that a simple pair.
Note this commit introduce a race condition. As we now open only one xapian db per archive, we may have different searchers using the same xapian db in the same time.
We will need to lock all db mutex in the same time when doing multizim search. And we must avoid dead locks in the process. Std lib provides us a `std::lock()` [1] function to lock several mutex in the same time. However, it is a variatic function and it doesn't take a list of mutex to lock. So we havet to implemented ourselves. But we don't want to re implement an error prone algorithm. To do so, we introduce a `MultiMutex` (a linked list) and we (ab)use the fact that `std::lock()` is a template taking any object "Lockable" [2] [1] https://en.cppreference.com/w/cpp/thread/lock [2] https://en.cppreference.com/w/cpp/named_req/Lockable
We finally lock all xapian Db mutexes in the same time before when access them.
Now m_direntLookup is created at fileimpl creation, not at first (user) access to the dirents. This is kind of useless to delay the dirent_lookup creation until the first access as user will need to create it to use the archive. For the rare case when this is not needed (search only, iter efficient), user can now simply deactivate the dirent lookup.
On Windows, static member are not exported in dlls. So we don't have a value to use as default for openConfig.
59d0bf4
to
f971c71
Compare
Now that we have released 9.3.0, I will rebase and merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fix #617 (preloading xapian database)
Numbers do not show a strong win (not a lost neither) but this PR also introduce nice improvement as side effect and so is a good preparatory work for next project about multi zim and multi lang issues.
Globally this PR does:
XapianDb
is about a specific database stored in a zim file.InternalDataBase
is about a database we search in (which "include"XapianDb
s)Archive::preloadXapianDb
to explicitly load the xapian db.The numbers
Globally, what is taking time is loading the data from fs into memory. And this is the purpose of this PR to load this data upfront.
So there is two use case to evaluate: a cold one (data is not loading) and warm one (data is loaded).
On top of that, locating the xapian db or using the search results imply loading entries from zim file to get the title. Accessing zim dirent will activate our dirent lookup cache and we will need to populate it (and we know it can take some time, that was the purpose of #950 to be able to deactivate it)
Numbers are obtained by instrumenting zimsearch and running
zimsearch wikispecies_en_all_maxi_2020-06.zim home
(and taking the mean of 3 runs).Numbers are in milliseconds. Main number is cold. Number is parentheses is warm.
Search (w getTitle)
is the time to do the search and use it. It include the time to get the title (from zim dirents) of all results (getTitle
).Search (wo getTitle)
is simply the difference between both, this is the real time taken by xapian to do the search.From
Preloading
table, we can see that loading the xapian DB itself is pretty fast, only 2ms.When dirent lookup is activated, it take 155ms but this is mostly the time to populate the dirent lookup cache. We found those numbers in
Create search
(no preload) as we load the xapian db (and populate the dirent lookup cache) when we start the search.The search performance itself is not really changed. Warm run is a bit faster but not a game changer.
Exploiting the search result (
getTitle
) also shows mitigated results:For the
Total
, preloading and dirent lookup cache also shows strange results (preloading should not impact the total, dirent lookup cache should have the same impact, what ever preloading is)At the end, it seem that most of the time spend during a search is about loading the dirent from zim file (either at dirent lookup population or at get title), not from xapian.
As said in the introduction, this PR is still interresting for its side effects: