Skip to content

Conversation

Aevil1
Copy link

@Aevil1 Aevil1 commented Jun 30, 2025

Fixes #1000

This patch strips MIME parameters (e.g., charset, profile) to normalize MIME types, removes duplicates, and filters out malformed or incomplete entries in the Counter metadata (e.g., entries without =count or invalid type/subtype format).

@kelson42 kelson42 requested a review from veloman-yunkan July 1, 2025 04:03
@veloman-yunkan
Copy link
Collaborator

@kelson42 I think that before fixing the problem with Counter metadata we must define how we want it to be fixed.

@kelson42
Copy link
Contributor

kelson42 commented Jul 8, 2025

@kelson42 I think that before fixing the problem with Counter metadata we must define how we want it to be fixed.

Seems straight to me, was is unclear?

@veloman-yunkan
Copy link
Collaborator

@kelson42 I have no problem with the part of the solution that deals with stripping of the MIME-type parameters during ZIM creation. But what should we do with unstripped MIME-types recorded in the MIME-type list and Counter metadata in existing ZIM-files? Maybe we should just acknowledge such ZIM-files as buggy and refrain from healing them on-the-fly by newer versions of libzim as attempted in this PR? BTW, this PR addresses the Counter metadata only and in a way that results in dropping (rather than correcting) those MIME-types that have been entered into the Counter metadata with parameters.

@kelson42
Copy link
Contributor

@kelson42 I have no problem with the part of the solution that deals with stripping of the MIME-type parameters during ZIM creation. But what should we do with unstripped MIME-types recorded in the MIME-type list and Counter metadata in existing ZIM-files? Maybe we should just acknowledge such ZIM-files as buggy and refrain from healing them on-the-fly by newer versions of libzim as attempted in this PR?

We need to be a bit flexible here. I'm not even sure we should consider MIME_type parameters as wrong "from a ZIM perspective".

BTW, this PR addresses the Counter metadata only and in a way that results in dropping (rather than correcting) those MIME-types that have been entered into the Counter metadata with parameters.

We should not ignore or drop them, we should count them all together (based on the mime-type). @Aevil1 Can you fix that please?

@kelson42
Copy link
Contributor

@Aevil1 Still motivated to complete the PR?

@kelson42
Copy link
Contributor

kelson42 commented Oct 1, 2025

@veloman-yunkan I guess we will have to close this PR and implement the fix in a new PR :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wikipedia_en_top_all has 829k entries instead of 50k
3 participants