Raw Common Crawl crawls, annotated with potential Creative Commons license information
- TODO: add context (x chars before/after the license)
The licensing information is extracted from the web pages based on whether they link to Creative Commons licenses but false positives may occur! While further filtering based on the location type of the license should improve the precision (e.g. by removing hyperlink (a_tag) references), false positives may still occur. Note that no quality filter occurs to ensure a wide coverage! Similarly, no deduplication was executed, both for computational reasons and because the data is quite scarce so you may want to deduplicate yourself with less strict constraints. However, as an indicator for quality, a column is added to indicate whether a sample exists in the FineWeb(-2) dataset.
The approach to creating this dataset is different from similar endeavors such as the awesome common-pile/dolma-cccc and C4Corpus datasets. They rely on intricately crafted regular expressions to quickly extract potential licenses from a web page (string-based matching). However, doing so makes it hard to retrieve any structural meta information about the license such as where it was found on the page. In C5, the whole webpage is parsed into a programmatic structure, allowing for an iterative search through this parsed "tree". That makes it possible to track where licenses were found (in the head of a document, for instance). Such information is crucial to minimise false positives: if a license is referred in a meta
tag in the head
of an HTML page, it is more trustworthy than a "random link" referring to a copyright license in the middle of a web page, which might just be discussing the license in general or providing a license for a picture on the website. Metadata about the license is powerful to attach confidence to the extracted licenses, enabling robust filtering to avoid false positives. While I strongly believe this approach is valuable it also makes it very slow compared to a regex search!
By default, we only process the following languages, although you can change this by add a languages
key to your YAML config file with the languages that you want. Default:
languages:
- afr_Latn
- deu_Latn
- eng_Latn
- fra_Latn
- fry_Latn
- ita_Latn
- nld_Latn
- spa_Latn
The following fields are extracted:
In some cases, multiple licenses are found on a single page. All licenses are collected in potential_licenses
. From these, the "best guess" is selected
based on three criteria. All licenses are sorted based on these properties and the first one is detected. E.g. licenses in the meta tag are preferred over an a
tag, or (location equal) a license in the head
or footer has preference over one in the body.
- location_preference_order: meta_tag, json-ld, link_tag, a_tag
- head_preference_order: True, False
- footer_preference_order: True, False
Based on these criteria, the "best guessed" license is picked as the one in the license_*
columns. Potential disagreement between multiple licenses is given in license_disagreement
.
- text: the extracted text (unmodified)
- id: WARC-Record-ID
- dump: Common Crawl crawl
- url: original url for document
- date: crawl date
- file_path: file path on the S3 bucket
- license_abbr: the license type. Possible values: "cc-unknown" (recommended to filter this one out), "by", "by-sa", "by-nd", "by-nc", "by-nc-sa", "by-nc-nd", "zero", "certification", "mark". If multiple licenses were found,
potential_licenses
contains all of them. - license_version: the license version, e.g. "4.0"
- license_location: the location where the license was found. Possible values: "meta_tag", "json-ld", "link_tag", "a_tag"
- license_in_head: whether the license was found inside a
head
HTML element - license_in_footer: whether the license was found inside a
footer
HTML element, or an HTML element that hadfooter
in the ID or class name - potential_licenses:
- abbr: list of all found license abbreviations
- version: list of all found license versions
- location: list of all found license locations
- in_head: list of whether licenses were found in the head
- in_footer: list of whether licenses were found in a footer
- license_parse_error: whether there was a problem when trying to extract the license, e.g. an unparseable HTML document
- license_disagreement: whether the
potential_licenses["abbr"]
disagree, i.e., different types of licenses were found. License versions are not included in the comparison! - language_script: the script of the language as detected by
glotlid
- language: the language, as detected by
glotlid
- language_score: the language identification confidence score
- found_in_fw: whether this sample was found in FineWeb(-2). For non-English, crawls that are more recent than FW2 (everything after 2024-18) is marked as None. For English, crawls that are more recent than FW v1.3 is marked as None (after 2024-51).
Simply pip install this repository. E.g., for an editable install:
python -m pip install -e .
While local
alternatives are given for running the pipeline on your local machine (mostly for debugging), the recommended use is via SLURM through scripts/run_slurm.py
. Usage is facilitated via the SLURM launch scripts in slurm/launch.slurm
. To use the scripts, you do need to take care of some things:
- The pipeline includes a check to see whether a sample exists in the FineWeb(-2) dataset as a quality signal. Download the DuckDB files of the languages that you are interested in. By default we process the languages mentioned above, so to download those to the expected
duckdbs/fineweb-2
directory inside this project root.
Update: this step is now automated. You do not have to do it manually anymore, but obviously if that works better for your workflow you can still do that.
huggingface-cli download BramVanroy/fineweb-2-duckdbs fw2-afr_Latn.duckdb --local-dir duckdbs/fineweb-2/ --repo-type dataset
huggingface-cli download BramVanroy/fineweb-2-duckdbs fw2-deu_Latn.duckdb --local-dir duckdbs/fineweb-2/ --repo-type dataset
huggingface-cli download BramVanroy/fineweb-2-duckdbs fw2-fra_Latn.duckdb --local-dir duckdbs/fineweb-2/ --repo-type dataset
huggingface-cli download BramVanroy/fineweb-2-duckdbs fw2-fry_Latn.duckdb --local-dir duckdbs/fineweb-2/ --repo-type dataset
huggingface-cli download BramVanroy/fineweb-2-duckdbs fw2-ita_Latn.duckdb --local-dir duckdbs/fineweb-2/ --repo-type dataset
huggingface-cli download BramVanroy/fineweb-2-duckdbs fw2-nld_Latn.duckdb --local-dir duckdbs/fineweb-2/ --repo-type dataset
huggingface-cli download BramVanroy/fineweb-2-duckdbs fw2-spa_Latn.duckdb --local-dir duckdbs/fineweb-2/ --repo-type dataset
For English, we use FineWeb DuckDBs. These are structured differently - one DuckDB per crawl name (e.g. 2024-51).
huggingface-cli download BramVanroy/fineweb-duckdbs fw-CC-MAIN-2024-51.duckdb --local-dir duckdbs/fineweb/ --repo-type dataset
- In the SLURM scripts under
slurm/
, change the constants/variables in capital letters with your specific use-case (account, partition, etc.). - Update the config under
configs/
depending on your hardware. This may take some trial and error on your specific system configuration but the default values are expected to work. In theconfigs/config-slurm.yaml
make sure to updated the root dir where you saved the DuckDB files
Now you can submit the job to start processing a specific crawl, e.g.
sbatch launch.slurm CC-MAIN-2024-51
Output of the first step will be saved, by default, in output-main/
and the final data (added column whether the sample exists in FineWeb(-2)) in output/
.
Hugging Face publicly keeps track of cease-and-desists they have received, including for domains that were removed from FineWeb(-2). I collect those domains in BramVanroy/finewebs-copyright-domains, which in turn allows us to filter out these offending domains from our dataset. From v1.3.0 of this library, the process is done automatically. For data that has been processed with an earlier version, a utility script is provided, which will filter out offending domains on a per-parquet-file basis.
In addition to the CC-BY dataset, you may wish to figure out which domains are NOT in it, so you can contact domain owners individually to strike an agreement about using their data. With the script scripts/analysis/find_top_domains.py, we first get all domains found in a specific language's FineWeb-2 dataset, including the _removed
portion. Then all domains that are present in the CC-BY dataset are removed from that initial set. Obviously this approach has issues: the CC-BY dataset does not contain the same crawls as FW2, so the potential coverage is different. Still, the results provide some insight into popular domains that were not (yet) found in the CC-BY data.
See the dataset page.
In the current absence of a publication, please cite the dataset as follows. Including a footnote url to this page is also appreciated!
@misc{vanroy2025C5,
author = { Bram Vanroy },
title = { CommonCrawl CreativeCommons Corpus (C5) },
year = 2025,
url = { https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons },
doi = { 10.57967/hf/5340 },
publisher = { Hugging Face }
}
If you use or modify the software, please cite:
@software{Vanroy_CommonCrawl-CreativeCommons_2025,
author = {Vanroy, Bram},
license = {GPL-3.0},
month = feb,
title = {{CommonCrawl-CreativeCommons}},
url = {https://github.com/BramVanroy/CommonCrawl-CreativeCommons},
version = {1.3.0},
year = {2025}
}
- The Common Crawl non-profit organization.
- TNO, who funded the work hours to accomplish this code. They intend to use (parts of) the generated material for the GPT-NL project.
- Flemish Supercomputer Center for part of the compute under grant 2024-107
- Guilherme Penedo (@guipenedo) and the rest of the FineWeb and datatrove team for the help and insights
- ML6 and specifically Robin Van Craenenbroek for their Fondant Creative Commons filter for image datasets. While my approach is different, their code did serve as inspiration.