Running
1
The Synthetic Data Playbook: Generating Trillions of the Finest Tokens
π
Visualize synthetic data experiments as an interactive bookshelf
We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Visualize synthetic data experiments as an interactive bookshelf
Viewer to explore the finewiki dataset
Generate a curated webβtext dataset for LLM training
Evaluate multilingual models using FineTasks
Explore and analyze experiment results
Launch an interactive demo interface