Skip to content

Conversation

@ThisIsFineTM
Copy link
Contributor

This introduces a case-insensitive lexicographical comparator for the mimetype map in zimwriterfs/tools. This allows us to remove the duplicate entries in the mime type map.

Closes #457.

@codecov
Copy link

codecov bot commented Mar 11, 2025

Codecov Report

❌ Patch coverage is 88.23529% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 27.48%. Comparing base (151ad83) to head (23d8a17).

Files with missing lines Patch % Lines
src/zimwriterfs/tools.cpp 88.23% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #458      +/-   ##
==========================================
+ Coverage   26.34%   27.48%   +1.14%     
==========================================
  Files          26       26              
  Lines        2460     2401      -59     
  Branches     1381     1314      -67     
==========================================
+ Hits          648      660      +12     
  Misses       1300     1300              
+ Partials      512      441      -71     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines 71 to 69
return MimeMap_t{
{"html", "text/html"},
{"htm", "text/html"},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By taking advantage of c++11 style initialization, the _create_extMimeTypes() function is rendered meaningless - you can initialize extMimeTypes directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move that initialization to be direct and force push over the previous commit.

}

static std::map<std::string, std::string> extMimeTypes = _create_extMimeTypes();
static MimeMap_t extMimeTypes = _create_extMimeTypes();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this const.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. I'll add that in a second commit because that has to be propagated. We can always squash it later.

Comment on lines 33 to 65
// adapted from cppreference example implementation of lexicographical_compare
struct tolower_str_comparator {
using UChar_t = unsigned char;

static constexpr UChar_t toLower(const UChar_t c) noexcept {
constexpr UChar_t offset {'a' - 'A'};
return (c <= 'Z' && c >= 'A') ? c + offset : c;
}

static constexpr bool charCompare(const UChar_t lhs, const UChar_t rhs) noexcept {
return toLower(lhs) < toLower(rhs);
}

bool operator() (const std::string& lhs, const std::string& rhs) const noexcept {
auto first1 = lhs.cbegin();
auto last1 = lhs.cend();
auto first2 = rhs.cbegin();
auto last2 = rhs.cend();

for (; (first1 != last1) && (first2 != last2); ++first1, ++first2)
{
if (charCompare(*first1, *first2)) {
return true;
}
if (charCompare(*first2, *first1)) {
return false;
}
}

return (first1 == last1) && (first2 != last2);
}
};

using MimeMap_t = std::map<std::string, const std::string, tolower_str_comparator>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems too much code for the problem addressed. With the current limited usage of extMimeTypes and some rewrite you can achieve the same effect as follows:

static const std::string& getMimeType(std::string ext) {
    static const std::map<std::string, const std::string> extMimeTypes = {
        {"html", "text/html"},
        ...
    };

    for ( size_t i = 0; i < ext.size(); ++i )
        ext[i] = std::tolower(ext[i]);

    return extMimeTypes.at(ext);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are more places where the custom comparator provides enhancements, most notably usage of std::map::find. I'd like to move away from relying on exceptions in cases where find can get the job done. I'm just breaking up the work into smaller PRs so the individual changes are less time consuming to review. I could provide helpers to the mimetype map to also work with find, but at that point we're hitting the same amount of lines as the comparator with the added detriment of having to make a copy of the ext string rather than getting it in-place for each query.

Copy link
Collaborator

@veloman-yunkan veloman-yunkan Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with custom comparator is that it results in a new instantiation of the std::map template, bloating the binary code.

I'd like to move away from relying on exceptions in cases where find can get the job done.

I too frown upon such usage of exceptions, and in this case it can be worked around by making getMimeType() return an empty string (which will have to be checked by the caller).

with the added detriment of having to make a copy of the ... ext string rather than getting it in-place for each query.

I believe you know the adage about premature optimization. The problem with it is that when tiny optimizations are achieved at the cost of added complexity the final result may be worse performance compared to a simpler solution (or a relatively easy modification of it that addresses bigger performance problems).

Copy link
Contributor Author

@ThisIsFineTM ThisIsFineTM Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with custom comparator is that it results in a new instantiation of the std::map template, bloating the binary code.

I'm curious as to how much that is actually an issue, so here's some data:
Comparing Release builds of zimwriterfs on GCC 14.2 (in bytes):

branch settings zimwriterfs size(bytes) %difference vs main
main release 364848 0
444-mimetype-own-unit release 361000 -1.06%
457-case-insensitive-comparator nonconst-map release 371720 +1.87%
457-case-insensitive-comparator + const map release 371256 +1.74%

So the main difference here is that in the 444 branch, the exception code has been removed in favor of find, whereas in the 457 branch just has the comparator swap-in. Despite concerns about the extra instantiation introducing ~10KB to the executable on its own, the removal of all the related stack unwinding code with the exceptions leads to a smaller binary size with the added ergonomics of not having to not have manually call to-lower at every map interaction.

I too frown upon such usage of exceptions, and in this case it can be worked around by making getMimeType() return an empty string (which will have to be checked by the caller).

If we really want to go that route, its more appropriate to return a std::optional.

I believe you know the adage about premature optimization. The problem with it is that when tiny optimizations are achieved at the cost of added complexity the final result may be worse performance compared to a simpler solution (or a relatively easy modification of it that addresses bigger performance problems).

Is it really added complexity though? The ergonomics of using the custom comparator are identical to when there were just duplicate entries in the map. The alternative of having the user have to check each map query for an empty string is more complexity in my opinion because it is a special case that requires knowledge of the source code rather than just relying on the behavioral guarantees of the standard library (Re: Principle of Least Surprise).

Also Re: premature optimization / performance, it looks like my implementation is a slight performance improvement, or at worst at parity.

main$ ninja -C build/ test
ninja: Entering directory `build/'
[0/1] Running all tests.
1/4 metadata-test                   OK              0.01s
2/4 tools-test                      OK              0.01s
3/4 zimcheck-test                   OK              0.10s
4/4 zimwriterfs-zimcreatorfs        OK              0.20s
444-mime-types-own-unit$ ninja -C ./build/ test
ninja: Entering directory `./build/'
[0/1] Running all tests.
1/4 metadata-test                   OK              0.01s
2/4 tools-test                      OK              0.00s
3/4 zimcheck-test                   OK              0.10s
4/4 zimwriterfs-zimcreatorfs        OK              0.19s
457-case-insensitive-comparator$ ninja -C ./build/ test
ninja: Entering directory `./build/'
[0/1] Running all tests.
1/4 metadata-test                   OK              0.01s
2/4 tools-test                      OK              0.01s
3/4 zimcheck-test                   OK              0.09s
4/4 zimwriterfs-zimcreatorfs        OK              0.19s

Copy link
Contributor Author

@ThisIsFineTM ThisIsFineTM Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note, if binary bloat is more of a concern, enabling LTO can cut the tool binary sizes by roughly 10-50% depending on which compiler is used (tested with gcc 14.2 and clang 19.1). Just need to put in a 1 line fix into the metadata test file to make the linker happy. I can do that in another PR after we resolve this one. Most of the space savings is just from LTO. Adding -ffunction-sections -fdata-sections --gc-sections on top of LTO was of negligible benefit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ThisIsFineTM I appreciate all the effort that you put into defending your solution. My point of view is that one should stick to the KISS principle unless there are good reasons to deviate from it. My proposal to encapsulate extMimeMap behind a single getMimeType() API (with getMimeType(ext).empty() serving as a test for ext not being a known extension) is along those lines.

However I don't mind if your solution is merged upon someone else's approval - IMHO the issue doesn't deserve that much discussion.

Copy link
Contributor Author

@ThisIsFineTM ThisIsFineTM Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@veloman-yunkan Yeah, we are bikeshedding a bit. Although the find with how much LTO can reduce the tool size was a nice perk. I'll open up a separate issue for that and put in a PR. I'll also clean up what I have here so when we get another pair of eyes on it its ready to go.

@kelson42 @mgautierfr I know you two are busy, but if one of you could take a glance at this at some point and give a cursory 👍 or 👎 that would be appreciated.

@ThisIsFineTM ThisIsFineTM force-pushed the 457-case-insensitive-comparator branch from 1365faa to 2fd2a9e Compare March 18, 2025 20:50
@kelson42
Copy link
Contributor

@veloman-yunkan I believe we should do a new review pass

@veloman-yunkan
Copy link
Collaborator

@veloman-yunkan I believe we should do a new review pass

@kelson42 Nothing has changed since the last review (weather & geopolitical events are irrelevant to this PR).

@kelson42 kelson42 force-pushed the 457-case-insensitive-comparator branch from 2fd2a9e to 23d8a17 Compare November 5, 2025 03:40
@kelson42
Copy link
Contributor

kelson42 commented Nov 5, 2025

@ThisIsFineTM This is a quite old PR and we would really like to see itbcompleted to merge it. Any chance you coukd complete it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Case-insensitive comparator for mimetype map

3 participants