-
-
Notifications
You must be signed in to change notification settings - Fork 43
Case-insensitive comparator for extMimeTypes. #458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Case-insensitive comparator for extMimeTypes. #458
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #458 +/- ##
==========================================
+ Coverage 26.34% 27.48% +1.14%
==========================================
Files 26 26
Lines 2460 2401 -59
Branches 1381 1314 -67
==========================================
+ Hits 648 660 +12
Misses 1300 1300
+ Partials 512 441 -71 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
src/zimwriterfs/tools.cpp
Outdated
| return MimeMap_t{ | ||
| {"html", "text/html"}, | ||
| {"htm", "text/html"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By taking advantage of c++11 style initialization, the _create_extMimeTypes() function is rendered meaningless - you can initialize extMimeTypes directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll move that initialization to be direct and force push over the previous commit.
src/zimwriterfs/tools.cpp
Outdated
| } | ||
|
|
||
| static std::map<std::string, std::string> extMimeTypes = _create_extMimeTypes(); | ||
| static MimeMap_t extMimeTypes = _create_extMimeTypes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this const.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. I'll add that in a second commit because that has to be propagated. We can always squash it later.
| // adapted from cppreference example implementation of lexicographical_compare | ||
| struct tolower_str_comparator { | ||
| using UChar_t = unsigned char; | ||
|
|
||
| static constexpr UChar_t toLower(const UChar_t c) noexcept { | ||
| constexpr UChar_t offset {'a' - 'A'}; | ||
| return (c <= 'Z' && c >= 'A') ? c + offset : c; | ||
| } | ||
|
|
||
| static constexpr bool charCompare(const UChar_t lhs, const UChar_t rhs) noexcept { | ||
| return toLower(lhs) < toLower(rhs); | ||
| } | ||
|
|
||
| bool operator() (const std::string& lhs, const std::string& rhs) const noexcept { | ||
| auto first1 = lhs.cbegin(); | ||
| auto last1 = lhs.cend(); | ||
| auto first2 = rhs.cbegin(); | ||
| auto last2 = rhs.cend(); | ||
|
|
||
| for (; (first1 != last1) && (first2 != last2); ++first1, ++first2) | ||
| { | ||
| if (charCompare(*first1, *first2)) { | ||
| return true; | ||
| } | ||
| if (charCompare(*first2, *first1)) { | ||
| return false; | ||
| } | ||
| } | ||
|
|
||
| return (first1 == last1) && (first2 != last2); | ||
| } | ||
| }; | ||
|
|
||
| using MimeMap_t = std::map<std::string, const std::string, tolower_str_comparator>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems too much code for the problem addressed. With the current limited usage of extMimeTypes and some rewrite you can achieve the same effect as follows:
static const std::string& getMimeType(std::string ext) {
static const std::map<std::string, const std::string> extMimeTypes = {
{"html", "text/html"},
...
};
for ( size_t i = 0; i < ext.size(); ++i )
ext[i] = std::tolower(ext[i]);
return extMimeTypes.at(ext);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are more places where the custom comparator provides enhancements, most notably usage of std::map::find. I'd like to move away from relying on exceptions in cases where find can get the job done. I'm just breaking up the work into smaller PRs so the individual changes are less time consuming to review. I could provide helpers to the mimetype map to also work with find, but at that point we're hitting the same amount of lines as the comparator with the added detriment of having to make a copy of the ext string rather than getting it in-place for each query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with custom comparator is that it results in a new instantiation of the std::map template, bloating the binary code.
I'd like to move away from relying on exceptions in cases where
findcan get the job done.
I too frown upon such usage of exceptions, and in this case it can be worked around by making getMimeType() return an empty string (which will have to be checked by the caller).
with the added detriment of having to make a copy of the ...
extstring rather than getting it in-place for each query.
I believe you know the adage about premature optimization. The problem with it is that when tiny optimizations are achieved at the cost of added complexity the final result may be worse performance compared to a simpler solution (or a relatively easy modification of it that addresses bigger performance problems).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with custom comparator is that it results in a new instantiation of the std::map template, bloating the binary code.
I'm curious as to how much that is actually an issue, so here's some data:
Comparing Release builds of zimwriterfs on GCC 14.2 (in bytes):
| branch | settings | zimwriterfs size(bytes) | %difference vs main |
|---|---|---|---|
| main | release | 364848 | 0 |
| 444-mimetype-own-unit | release | 361000 | -1.06% |
| 457-case-insensitive-comparator nonconst-map | release | 371720 | +1.87% |
| 457-case-insensitive-comparator + const map | release | 371256 | +1.74% |
So the main difference here is that in the 444 branch, the exception code has been removed in favor of find, whereas in the 457 branch just has the comparator swap-in. Despite concerns about the extra instantiation introducing ~10KB to the executable on its own, the removal of all the related stack unwinding code with the exceptions leads to a smaller binary size with the added ergonomics of not having to not have manually call to-lower at every map interaction.
I too frown upon such usage of exceptions, and in this case it can be worked around by making getMimeType() return an empty string (which will have to be checked by the caller).
If we really want to go that route, its more appropriate to return a std::optional.
I believe you know the adage about premature optimization. The problem with it is that when tiny optimizations are achieved at the cost of added complexity the final result may be worse performance compared to a simpler solution (or a relatively easy modification of it that addresses bigger performance problems).
Is it really added complexity though? The ergonomics of using the custom comparator are identical to when there were just duplicate entries in the map. The alternative of having the user have to check each map query for an empty string is more complexity in my opinion because it is a special case that requires knowledge of the source code rather than just relying on the behavioral guarantees of the standard library (Re: Principle of Least Surprise).
Also Re: premature optimization / performance, it looks like my implementation is a slight performance improvement, or at worst at parity.
main$ ninja -C build/ test
ninja: Entering directory `build/'
[0/1] Running all tests.
1/4 metadata-test OK 0.01s
2/4 tools-test OK 0.01s
3/4 zimcheck-test OK 0.10s
4/4 zimwriterfs-zimcreatorfs OK 0.20s444-mime-types-own-unit$ ninja -C ./build/ test
ninja: Entering directory `./build/'
[0/1] Running all tests.
1/4 metadata-test OK 0.01s
2/4 tools-test OK 0.00s
3/4 zimcheck-test OK 0.10s
4/4 zimwriterfs-zimcreatorfs OK 0.19s457-case-insensitive-comparator$ ninja -C ./build/ test
ninja: Entering directory `./build/'
[0/1] Running all tests.
1/4 metadata-test OK 0.01s
2/4 tools-test OK 0.01s
3/4 zimcheck-test OK 0.09s
4/4 zimwriterfs-zimcreatorfs OK 0.19sThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a side note, if binary bloat is more of a concern, enabling LTO can cut the tool binary sizes by roughly 10-50% depending on which compiler is used (tested with gcc 14.2 and clang 19.1). Just need to put in a 1 line fix into the metadata test file to make the linker happy. I can do that in another PR after we resolve this one. Most of the space savings is just from LTO. Adding -ffunction-sections -fdata-sections --gc-sections on top of LTO was of negligible benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ThisIsFineTM I appreciate all the effort that you put into defending your solution. My point of view is that one should stick to the KISS principle unless there are good reasons to deviate from it. My proposal to encapsulate extMimeMap behind a single getMimeType() API (with getMimeType(ext).empty() serving as a test for ext not being a known extension) is along those lines.
However I don't mind if your solution is merged upon someone else's approval - IMHO the issue doesn't deserve that much discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@veloman-yunkan Yeah, we are bikeshedding a bit. Although the find with how much LTO can reduce the tool size was a nice perk. I'll open up a separate issue for that and put in a PR. I'll also clean up what I have here so when we get another pair of eyes on it its ready to go.
@kelson42 @mgautierfr I know you two are busy, but if one of you could take a glance at this at some point and give a cursory 👍 or 👎 that would be appreciated.
1365faa to
2fd2a9e
Compare
|
@veloman-yunkan I believe we should do a new review pass |
@kelson42 Nothing has changed since the last review (weather & geopolitical events are irrelevant to this PR). |
2fd2a9e to
23d8a17
Compare
|
@ThisIsFineTM This is a quite old PR and we would really like to see itbcompleted to merge it. Any chance you coukd complete it? |
This introduces a case-insensitive lexicographical comparator for the mimetype map in zimwriterfs/tools. This allows us to remove the duplicate entries in the mime type map.
Closes #457.