A collection of files gathered from different sources to be used for tests that compare mimetype with the UNIX file utility.
TLDR: ~97% of samples identified correctly
The 3% misidentified files,
most are indeed misidentified files, but some happen because mimetype
identifies more precisely than file
:
- XML based file formats, like GML, GPX, are seens as generic
text/xml
byfile
mimetype
identifies subtitles astext/vtt
, whilefile
sees them just asplain/text
mimetype
identifiestext/tab-separated-values
, whilefile
sees justplain/text
- etc.
Results show the latest percentage of misidentified files and a breakdown of what are the most misidentified formats. If you want to run the tests, use these commands.
- testfiles contains all the test files (around 50 000 entries)
- zipshuffler.go reads zip files and then creates random permutations of the files inside the zip.
- truncate.go creates 3KB truncated copies of all the files
- main.go iterates over all files and compares our results with the
results of
file --mime