Web2LLM

A command-line tool to scrape web pages, GitHub repos, local folders, and PDFs into clean, aggregated Markdown suitable for Large Language Models.

Description

This tool provides a unified interface to process various sources—from live websites and code repositories to local directories and PDF files—and convert them into a structured Markdown format. The clean, token-efficient output is ideal for use as context in prompts for Large Language Models, for Retrieval-Augmented Generation (RAG) pipelines, or for documentation archiving.

Installation

For standard scraping of static websites, local files, and GitHub repositories, install the base package:

pip install web2llm

To enable JavaScript rendering for Single-Page Applications (SPAs) and other dynamic websites, you must install the [js] extra, which includes Playwright:

pip install "web2llm[js]"

After installing the js extra, you must also download the necessary browser binaries for Playwright to function:

playwright install

Usage

Command-Line Interface

The tool is run from the command line with the following structure:

web2llm <SOURCE> -o <OUTPUT_NAME> [OPTIONS]

<SOURCE>: The URL or local path to scrape.
-o, --output: The base name for the output folder and the .md and .json files created inside it.

All scraped content is saved to a new directory at output/<OUTPUT_NAME>/.

General Options:

--debug: Enable debug mode for verbose, step-by-step output to stderr.

Web Scraper Options (For URLs):

--render-js: Render JavaScript using a headless browser. Slower but necessary for SPAs. Requires installation with the [js] extra.
--check-content-type: Force a network request to check the page's Content-Type header. Use for URLs that serve PDFs without a .pdf extension.

Filesystem Options (For GitHub & Local Folders):

--exclude <PATTERN>: A .gitignore-style pattern for files/directories to exclude. Can be used multiple times.
--include <PATTERN>: A pattern to re-include a file that would otherwise be ignored by default or by an --exclude rule. Can be used multiple times.
--include-all: Disables all default and project-level ignore patterns. Explicit --exclude flags are still respected.

Configuration

web2llm uses a hierarchical configuration system that gives you precise control over the scraping process:

Default Config: The tool comes with a built-in default_config.yaml containing a robust set of ignore patterns for common development files and selectors for web scraping.
Project-Specific Config: You can create a .web2llm.yaml file in the root of your project to override or extend the default settings. This is the recommended way to manage project-specific rules.
CLI Arguments: Command-line flags provide the final layer of control, overriding any settings from the configuration files for a single run.

Examples

1. Scrape a specific directory within a GitHub repo:

web2llm 'https://github.com/tiangolo/fastapi' -o fastapi-src --include 'fastapi/'

2. Scrape a local project, excluding test and documentation folders:

web2llm '~/dev/my-project' -o my-project-code --exclude 'tests/' --exclude 'docs/'

3. Scrape a local project but re-include the LICENSE file, which is ignored by default:

web2llm '.' -o my-project-with-license --include '!LICENSE'

4. Scrape everything in a project except the .git directory:

web2llm . -o my-project-full --include-all --exclude '.git/'

5. Scrape just the "Installation" section from a webpage:

web2llm 'https://fastapi.tiangolo.com/#installation' -o fastapi-install

6. Scrape a PDF from an arXiv URL:

web2llm 'https://arxiv.org/pdf/1706.03762.pdf' -o attention-is-all-you-need

Contributing

Contributions are welcome. Please refer to the project's issue tracker and CONTRIBUTING.md file for information on how to participate.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
tests		tests
web2llm		web2llm
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENCE		LICENCE
README.md		README.md
flake.lock		flake.lock
flake.nix		flake.nix
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web2LLM

Description

Installation

Usage

Command-Line Interface

General Options:

Web Scraper Options (For URLs):

Filesystem Options (For GitHub & Local Folders):

Configuration

Examples

Contributing

License

About

Uh oh!

Releases 3

Languages

License

herruzo99/web2llm

Folders and files

Latest commit

History

Repository files navigation

Web2LLM

Description

Installation

Usage

Command-Line Interface

General Options:

Web Scraper Options (For URLs):

Filesystem Options (For GitHub & Local Folders):

Configuration

Examples

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Languages