🤖 AI Summary
To address the core challenges in large language model pretraining—namely, the difficulty of organizing high-quality web data, its high acquisition cost, and the absence of a standardized annotation schema—this paper introduces WebTaxonomy, a structured web dataset comprising 24 trillion tokens. We propose a novel, fine-grained, queryable taxonomy framework spanning twelve dimensions—including topic, format, complexity, and quality—to enable precise data curation. Additionally, we develop EAI-Distill-0.5b, a lightweight distillation model achieving annotation consistency comparable to that of massive foundation models. Leveraging SQL-style filtering, multidimensional semantic annotations, and Hugging Face hosting, our approach yields domain-specific subsets that surpass state-of-the-art (SOTA) alternatives: +24.5% on STEM, +8.6% on medical, and +14.3% on web code benchmarks—albeit with a modest −8.0% drop on mathematical tasks. These results empirically validate the critical role of high-quality, structured data in enhancing domain-specific model performance.
📝 Abstract
Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0