Essential-Web v1.0: 24T tokens of organized web data

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the core challenges in large language model pretraining—namely, the difficulty of organizing high-quality web data, its high acquisition cost, and the absence of a standardized annotation schema—this paper introduces WebTaxonomy, a structured web dataset comprising 24 trillion tokens. We propose a novel, fine-grained, queryable taxonomy framework spanning twelve dimensions—including topic, format, complexity, and quality—to enable precise data curation. Additionally, we develop EAI-Distill-0.5b, a lightweight distillation model achieving annotation consistency comparable to that of massive foundation models. Leveraging SQL-style filtering, multidimensional semantic annotations, and Hugging Face hosting, our approach yields domain-specific subsets that surpass state-of-the-art (SOTA) alternatives: +24.5% on STEM, +8.6% on medical, and +14.3% on web code benchmarks—albeit with a modest −8.0% drop on mathematical tasks. These results empirically validate the critical role of high-quality, structured data in enhancing domain-specific model performance.

Technology Category

Application Category

📝 Abstract
Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0
Problem

Research questions and friction points this paper is trying to address.

Lack of massive, well-organized pre-training datasets
Costly and inaccessible data pipelines for models
Need for annotated datasets to improve model skills
Innovation

Methods, ideas, or system contributions that make the work stand out.

24T-token web dataset with taxonomy
EAI-Distill-0.5b model for annotation
SQL filters enable competitive data subsets
🔎 Similar Papers
No similar papers found.
A
Andrew Hojel
Essential AI, San Francisco, CA
Michael Pust
Michael Pust
Essential AI
LLMAINLUMLMT
T
T. Romanski
Essential AI, San Francisco, CA
Y
Yash Vanjani
Essential AI, San Francisco, CA
Ritvik Kapila
Ritvik Kapila
ML Research Scientist, Essential AI
LLM Pre-trainingDeep LearningPrivacy Preserving ML
M
Mohit Parmar
Essential AI, San Francisco, CA
A
Adarsh Chaluvaraju
Essential AI, San Francisco, CA
A
Alok Tripathy
Essential AI, San Francisco, CA
Anil Thomas
Anil Thomas
Luminide, Inc.
A
A. Tanwer
Essential AI, San Francisco, CA
Darsh J Shah
Darsh J Shah
Massachusetts Institute of Technology
Natural Language ProcessingMachine Learning
Ishaan Shah
Ishaan Shah
Research Engineer, GraphDeco, INRIA Sophia-Antipolis
Computer GraphicsRenderingLight Transport Simulation
Karl Stratos
Karl Stratos
Apple AI/ML
Natural Language ProcessingDeep Learning
K
Khoi Nguyen
Essential AI, San Francisco, CA
K
Kurt Smith
Essential AI, San Francisco, CA
M
Michael Callahan
Essential AI, San Francisco, CA
P
Peter Rushton
Essential AI, San Francisco, CA
P
Philip Monk
Essential AI, San Francisco, CA
P
Platon Mazarakis
Essential AI, San Francisco, CA
S
Saad Jamal
Essential AI, San Francisco, CA
S
Saurabh Srivastava
Essential AI, San Francisco, CA
Somanshu Singla
Somanshu Singla
Research @ Essential AI
Multimodal LearningLLM ReasoningLLM Alignment
Ashish Vaswani
Ashish Vaswani
Essential AI, San Francisco, CA