A command-line utility that extracts files of a specific content type within a .WARC file.
# release build
meson setup build --buildtype=release
# debug build with sanitizers and debug printing enabled
meson setup build --buildtype=debug -Db_sanitize=address,undefined -Dc_args="$CFLAGS -DENABLE_DEBUG_PRINT"
Usage: warc-extract FILENAME [OPTIONS [PARAMS]]
Options:
--help Show help
--verbose Enable verbose mode
--content-type Specify content type to extract. Default: text/html
--file-suffix Specify file suffix/extension. Default: .html
# will extract:
# 0000001-my-warc-file.warc.html
# 0000002-my-warc-file.warc.html
# etc
./warc-extract my-warc-file.warc --content-type text/html --file-suffix .html
# extract .txt files, useful for robots.txt archives
./warc-extract my-warc-file.warc --content-type text/plain --file-suffix .txt
This program is available under the MIT License. See LICENSE.
Use at your own risk. You may end up extracting potentially harmful or malicious content because the internet is a wild place.