| Description |
Common Crawl January 2026 Crawl WET files (CC-MAIN-2026-04). The January 2026 crawl archive contains 2.30 billion pages, see the announcement for details. |
| Folder |
/datasets/ai/commoncrawl |
| Discipline |
AI / Large Language Models / NLP |
| DOI |
|
| Link |
Access Data |
| Public |
True |
| Publication Date |
2026-04 |
| Downloaded |
2026-03-07 |
| Data Type |
LMDB, SquashFS, Extracted WET files on Ceph |
| Dataset Size |
6.5TB (extracted) |
| Number of Files |
100,000 (extracted) |
| Usage |
$ module avail $ module load datasets $ module load ai/commoncrawl/2026-04 |
| Usage Policy Link |
https://commoncrawl.org/terms-of-use |
| Usage Policy |
Users of the Common Crawl dataset must comply with all applicable laws and respect the intellectual property and privacy rights of original content owners, as the data is derived from publicly crawled web sources and may include copyrighted or sensitive material. The dataset is provided โas isโ without guarantees of accuracy, legality, or completeness, and users are fully responsible for evaluating and filtering the data before use, especially in AI/ML applications. Any redistribution must preserve these responsibilities, must not claim ownership of the original data, and should clearly inform downstream users of applicable restrictions. By using this dataset, users assume all risks and liabilities associated with its use, including any legal or ethical implications. |
| Citation |
Common Crawl. (2026). Common Crawl corpus (CC-MAIN-2026-04) [Data set]. https://commoncrawl.org/ |
| BibTeX |
๐ View BibTeX citation@dataset{commoncrawl_2026_04, author = {Common Crawl}, title = {Common Crawl corpus (CC-MAIN-2026-04)}, year = {2026}, howpublished = {\url{https://commoncrawl.org/}}, note = {Data set} } |