commoncrawl¶

Back to AI datasets

Field	Value
Description	Common Crawl January 2026 Crawl WET files (CC-MAIN-2026-04). The January 2026 crawl archive contains 2.30 billion pages, see the announcement for details.
Folder	`/datasets/ai/commoncrawl`
Discipline	AI / Large Language Models / NLP
DOI
Link	Access Data
Public	`True`
Publication Date	2026-04
Downloaded	2026-03-07
Data Type	LMDB, SquashFS, Extracted WET files on Ceph
Dataset Size	6.5TB (extracted)
Number of Files	100,000 (extracted)
Usage	$ module avail $ module load datasets $ module load ai/commoncrawl/2026-04
Usage Policy Link	https://commoncrawl.org/terms-of-use
Usage Policy	Users of the Common Crawl dataset must comply with all applicable laws and respect the intellectual property and privacy rights of original content owners, as the data is derived from publicly crawled web sources and may include copyrighted or sensitive material. The dataset is provided “as is” without guarantees of accuracy, legality, or completeness, and users are fully responsible for evaluating and filtering the data before use, especially in AI/ML applications. Any redistribution must preserve these responsibilities, must not claim ownership of the original data, and should clearly inform downstream users of applicable restrictions. By using this dataset, users assume all risks and liabilities associated with its use, including any legal or ethical implications.
Citation	Common Crawl. (2026). Common Crawl corpus (CC-MAIN-2026-04) [Data set]. https://commoncrawl.org/
BibTeX	📜 View BibTeX citation @dataset{commoncrawl_2026_04, author = {Common Crawl}, title = {Common Crawl corpus (CC-MAIN-2026-04)}, year = {2026}, howpublished = {\url{https://commoncrawl.org/}}, note = {Data set} }