Skip to content

commoncrawl

Back to AI datasets

Field Value
Description Common Crawl January 2026 Crawl WET files (CC-MAIN-2026-04). The January 2026 crawl archive contains 2.30 billion pages, see the announcement for details.
Folder /datasets/ai/commoncrawl
Discipline AI / Large Language Models / NLP
DOI
Link Access Data
Public True
Publication Date 2026-04
Downloaded 2026-03-07
Data Type LMDB, SquashFS, Extracted WET files on Ceph
Dataset Size 6.5TB (extracted)
Number of Files 100,000 (extracted)
Usage
$ module avail
$ module load datasets
$ module load ai/commoncrawl/2026-04
Usage Policy Link https://commoncrawl.org/terms-of-use
Usage Policy Users of the Common Crawl dataset must comply with all applicable laws and respect the intellectual property and privacy rights of original content owners, as the data is derived from publicly crawled web sources and may include copyrighted or sensitive material. The dataset is provided โ€œas isโ€ without guarantees of accuracy, legality, or completeness, and users are fully responsible for evaluating and filtering the data before use, especially in AI/ML applications. Any redistribution must preserve these responsibilities, must not claim ownership of the original data, and should clearly inform downstream users of applicable restrictions. By using this dataset, users assume all risks and liabilities associated with its use, including any legal or ethical implications.
Citation Common Crawl. (2026). Common Crawl corpus (CC-MAIN-2026-04) [Data set]. https://commoncrawl.org/
BibTeX
๐Ÿ“œ View BibTeX citation
@dataset{commoncrawl_2026_04,
author = {Common Crawl},
title = {Common Crawl corpus (CC-MAIN-2026-04)},
year = {2026},
howpublished = {\url{https://commoncrawl.org/}},
note = {Data set}
}