The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

To address the critical shortage of high-quality, openly licensed German-language training data—which severely constrains the development of open large language models—this paper introduces the largest publicly available German text corpus licensed under CC-BY-SA 4.0 or later. The corpus spans seven domains—including law, science, and culture—and comprises 15.456 billion high-quality tokens. Methodologically, we integrate multi-source crawling, German-specific quality filtering, precise deduplication, and systematic text repair, and we release our domain-optimized preprocessing pipeline as open source. Crucially, our end-to-end construction process is fully reproducible and extensible. This work fills a fundamental gap in open, non-English training data for large language models and establishes a legally compliant, scalable foundation for training, distributing, and iteratively improving German-language open models.

Technology Category

Application Category

📝 Abstract

Large language model development relies on large-scale training corpora, yet most contain data of unclear licensing status, limiting the development of truly open models. This problem is exacerbated for non-English languages, where openly licensed text remains critically scarce. We introduce the German Commons, the largest collection of openly licensed German text to date. It compiles data from 41 sources across seven domains, encompassing legal, scientific, cultural, political, news, economic, and web text. Through systematic sourcing from established data providers with verifiable licensing, it yields 154.56 billion tokens of high-quality text for language model training. Our processing pipeline implements comprehensive quality filtering, deduplication, and text formatting fixes, ensuring consistent quality across heterogeneous text sources. All domain subsets feature licenses of at least CC-BY-SA 4.0 or equivalent, ensuring legal compliance for model training and redistribution. The German Commons therefore addresses the critical gap in openly licensed German pretraining data, and enables the development of truly open German language models. We also release code for corpus construction and data filtering tailored to German language text, rendering the German Commons fully reproducible and extensible.

Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of openly licensed German text data

Providing 154 billion tokens for German language model training

Ensuring legal compliance and quality across diverse domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest openly licensed German text collection

Systematic sourcing from 41 verifiable license sources

Comprehensive quality filtering and deduplication pipeline

🔎 Similar Papers

No similar papers found.