warc-extract

A command-line utility that extracts files of a specific content type within a .WARC file.

Building

# release build
meson setup build --buildtype=release

# debug build with sanitizers and debug printing enabled
meson setup build --buildtype=debug -Db_sanitize=address,undefined -Dc_args="$CFLAGS -DENABLE_DEBUG_PRINT"

Usage

Usage: warc-extract FILENAME [OPTIONS [PARAMS]]

Options:
        --help                  Show help
        --verbose               Enable verbose mode

        --content-type          Specify content type to extract. Default: text/html
        --file-suffix           Specify file suffix/extension. Default: .html

Examples

# will extract: 
#   0000001-my-warc-file.warc.html
#   0000002-my-warc-file.warc.html
# etc
./warc-extract my-warc-file.warc --content-type text/html --file-suffix .html

# extract .txt files, useful for robots.txt archives
./warc-extract my-warc-file.warc --content-type text/plain --file-suffix .txt

License

This program is available under the MIT License. See LICENSE.

Notes

Use at your own risk. You may end up extracting potentially harmful or malicious content because the internet is a wild place.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
meson.build		meson.build
setup-build.sh		setup-build.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

warc-extract

Building

Usage

Examples

License

Notes

About

Uh oh!

Releases

Packages

Languages

License

alexmi1/warc-extract

Folders and files

Latest commit

History

Repository files navigation

warc-extract

Building

Usage

Examples

License

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages