AllenAI C4 Dataset

https://huggingface.co/datasets/allenai/c4

How to download and prepare dataset

https://github.com/allenai/allennlp/discussions/5056

Change the current language dataset to C4: https://huggingface.co/datasets/allenai/c4

Breakdown of dataset:

image.png

Hugging Face FineWeb

https://huggingface.co/datasets/HuggingFaceFW/fineweb

What is it?

The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library.

🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of the full dataset under the ODC-By 1.0 license. However, by carefully adding additional filtering steps, we managed to push the performance of 🍷 FineWeb well above that of the original 🦅 RefinedWeb, and models trained on our dataset also outperform models trained on other commonly used high quality web datasets (like C4, Dolma-v1.6, The Pile, SlimPajama, RedPajam2) on our aggregate group of benchmark tasks.