🤖 AI Summary
Current LLM pretraining heavily relies on trillion-token corpora containing substantial copyrighted or proprietary content, posing significant challenges to global AI data security and regulatory compliance. Method: We construct the largest open-source, copyright-compliant pretraining corpus to date—approximately 2 trillion tokens—exclusively drawn from public-domain or explicitly licensed sources, spanning multilingual text (including low-resource languages) and large-scale code. We introduce a four-dimensional cleaning framework: copyright metadata identification, multi-level license validation, cross-lingual quality filtering, and time- and domain-balanced sampling. Contribution/Results: This is the first systematic, end-to-end transparently traceable construction of a large-scale, fully copyright-compliant, and highly diverse open corpus. Adopted by organizations including Anthropic, it underpins multiple open LLM training efforts and has emerged as the de facto foundational standard for compliant pretraining in the open science era.
📝 Abstract
Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. These data most often contain trillions of tokens with large portions of copyrighted or proprietary content, which hinders the usage of such models under AI legislation. This raises the need for truly open pre-training data that is compliant with the data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for language model pre-training. The data assembled in Common Corpus are either uncopyrighted or under permissible licenses and amount to about two trillion tokens. The dataset contains a wide variety of languages, ranging from the main European languages to low-resource ones rarely present in pre-training datasets; in addition, it includes a large portion of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs in diverse areas of knowledge. In this technical report, we present the detailed provenance of data assembling and the details of dataset filtering and curation. Being already used by such industry leaders as Anthropic and multiple LLM training projects, we believe that Common Corpus will become a critical infrastructure for open science research in LLMs.