The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

πŸ“… 2025-06-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address intellectual property and ethical risks arising from the widespread use of unlicensed text in large language model (LLM) training, this work introduces and open-sources Common Pile v0.1β€”the first 8TB high-quality, fully copyright-verified corpus licensed for public use, spanning 30 source categories including academic papers, source code, books, and encyclopedias. Methodologically, we integrate multi-source compliant web crawling, automated license identification with human verification, rigorous deduplication, and quality filtering. We further release reproducible training recipes and pre-trained model checkpoints. Leveraging this dataset, the Comma series 7B models (v0.1-1T/2T) achieve performance on par with Llama 1/2 7B under equivalent compute budgets, attaining competitive results across mainstream benchmarks. This work establishes a foundational, legally sound, controllable, and reproducible infrastructure for responsible LLM pretraining.

Technology Category

Application Category

πŸ“ Abstract
Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.
Problem

Research questions and friction points this paper is trying to address.

Addresses ethical concerns in LLM training by using openly licensed text
Solves the lack of large, high-quality openly licensed text datasets
Demonstrates competitive LLM performance with ethically sourced training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

8TB openly licensed text dataset
Diverse 30-source content curation
Competitive 7B parameter LLMs trained
πŸ”Ž Similar Papers
No similar papers found.
Nikhil Kandpal
Nikhil Kandpal
Computer Science Ph.D. Candidate, University of Toronto
Machine LearningPrivacy
Brian Lester
Brian Lester
University of Toronto, Vector Institute
Colin Raffel
Colin Raffel
University of Toronto, Vector Institute and Hugging Face
Machine Learning
S
Sebastian Majstorovic
EleutherAI
Stella Biderman
Stella Biderman
EleutherAI
Natural Language ProcessingArtificial IntelligenceLanguage ModelingDeep Learning
B
Baber Abbasi
EleutherAI
Luca Soldaini
Luca Soldaini
Allen Institute for AI
Large Language ModelsOpen Source AIInformation Retrieval
E
Enrico Shippole
Teraflop AI
A. Feder Cooper
A. Feder Cooper
Stanford, Microsoft Research
machine learningtech policy
A
Aviya Skowron
EleutherAI
John Kirchenbauer
John Kirchenbauer
University of Maryland, College Park
Machine LearningNatural Language ProcessingLarge Language ModelsML Security
Shayne Longpre
Shayne Longpre
MIT, Stanford, Apple
Deep LearningNatural Language Understanding
Lintang Sutawika
Lintang Sutawika
Language Technology Institute at Carnegie Mellon Institute
Natural Language ProcessingLanguage Modeling
Alon Albalak
Alon Albalak
Lila Sciences
Data-Centric AIMachine LearningOpen-Endedness
Zhenlin Xu
Zhenlin Xu
Boson AI
representation learningmultimodalLLM
Guilherme Penedo
Guilherme Penedo
ML Research Engineer at πŸ€— HuggingFace
L
Loubna Ben Allal
Hugging Face
Elie Bakouch
Elie Bakouch
Research Engineer at Hugging Face
machine learning
J
John David Pressman
EleutherAI
H
Honglu Fan
EleutherAI, poolside
D
Dashiell Stander
EleutherAI
G
Guangyu Song
EleutherAI
Aaron Gokaslan
Aaron Gokaslan
Cornell University
computer visiongraphicsdeep learningrobotics
Tom Goldstein
Tom Goldstein
Volpi-Cupal Professor of Computer Science, University of Maryland
Numerical OptimizationMachine LearningDistributed ComputingComputer Vision
B
Brian R. Bartoldson
Lawrence Livermore National Laboratory
Bhavya Kailkhura
Bhavya Kailkhura
Research Scientist, Lawrence Livermore National Laboratory
AI Security & AlignmentCompressed & Fast AI
T
Tyler Murray
The Allen Institute for Artificial Intelligence