Skip to content

alexmi1/warc-extract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

warc-extract

A command-line utility that extracts files of a specific content type within a .WARC file.

Building

# release build
meson setup build --buildtype=release

# debug build with sanitizers and debug printing enabled
meson setup build --buildtype=debug -Db_sanitize=address,undefined -Dc_args="$CFLAGS -DENABLE_DEBUG_PRINT"

Usage

Usage: warc-extract FILENAME [OPTIONS [PARAMS]]

Options:
        --help                  Show help
        --verbose               Enable verbose mode

        --content-type          Specify content type to extract. Default: text/html
        --file-suffix           Specify file suffix/extension. Default: .html

Examples

# will extract: 
#   0000001-my-warc-file.warc.html
#   0000002-my-warc-file.warc.html
# etc
./warc-extract my-warc-file.warc --content-type text/html --file-suffix .html

# extract .txt files, useful for robots.txt archives
./warc-extract my-warc-file.warc --content-type text/plain --file-suffix .txt

License

This program is available under the MIT License. See LICENSE.

Notes

Use at your own risk. You may end up extracting potentially harmful or malicious content because the internet is a wild place.

About

A command-line utility that extracts files from a WARC file.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published