Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Current LLM pretraining heavily relies on trillion-token corpora containing substantial copyrighted or proprietary content, posing significant challenges to global AI data security and regulatory compliance. Method: We construct the largest open-source, copyright-compliant pretraining corpus to date—approximately 2 trillion tokens—exclusively drawn from public-domain or explicitly licensed sources, spanning multilingual text (including low-resource languages) and large-scale code. We introduce a four-dimensional cleaning framework: copyright metadata identification, multi-level license validation, cross-lingual quality filtering, and time- and domain-balanced sampling. Contribution/Results: This is the first systematic, end-to-end transparently traceable construction of a large-scale, fully copyright-compliant, and highly diverse open corpus. Adopted by organizations including Anthropic, it underpins multiple open LLM training efforts and has emerged as the de facto foundational standard for compliant pretraining in the open science era.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. These data most often contain trillions of tokens with large portions of copyrighted or proprietary content, which hinders the usage of such models under AI legislation. This raises the need for truly open pre-training data that is compliant with the data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for language model pre-training. The data assembled in Common Corpus are either uncopyrighted or under permissible licenses and amount to about two trillion tokens. The dataset contains a wide variety of languages, ranging from the main European languages to low-resource ones rarely present in pre-training datasets; in addition, it includes a large portion of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs in diverse areas of knowledge. In this technical report, we present the detailed provenance of data assembling and the details of dataset filtering and curation. Being already used by such industry leaders as Anthropic and multiple LLM training projects, we believe that Common Corpus will become a critical infrastructure for open science research in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Lack of open pre-training data compliant with AI legislation

Need for diverse, permissible-license data for LLM training

Absence of large-scale multilingual and code-inclusive datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest open dataset for LLM pre-training

Uncopyrighted or permissively licensed data

Diverse languages and code data included

🔎 Similar Papers

A Survey on Moral Foundation Theory and Pre-Trained Language Models: Current Advances and Challenges