Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This study addresses the unresolved trade-off between data diversity and repeated training on high-quality data in large language model development for high-resource non-English languages, focusing on German. The authors construct a hierarchical quality filter to process 500 million German documents and systematically compare, under a fixed token budget, single-pass training on massive lower-quality data versus multi-epoch training on smaller, curated high-quality corpora. They demonstrate—for the first time in a non-English setting—that even after up to seven epochs of repetition, high-quality data consistently yields significantly better performance than diverse but noisy data. Leveraging this insight, they train Boldt, a German language model that achieves state-of-the-art results using only 1/10 to 1/360 of the training tokens required by prior approaches, and release a cleaned evaluation benchmark to support future research.

📝 Abstract

Recent research has shown that filtering massive English web corpora into high-quality subsets significantly improves training efficiency. However, for high-resource non-English languages like German, French, or Japanese, aggressive filtering creates a strategic dilemma: should practitioners prioritize diversity by training once on large amounts of lightly filtered web data, or prioritize quality by strictly filtering for a high-quality core and repeating it over multiple epochs? We investigate this trade-off for German by constructing hierarchical quality filters applied to 500M web documents, comparing multi-epoch training on the filtered subsets against single-pass training on a diverse corpus. Our experiments across multiple model scales and token budgets show that repeating high-quality data consistently outperforms single-pass training on larger, less filtered sets. Notably, the performance gap persists even after 7 epochs. Our findings suggest that for non-English LLMs, semantic concentration through quality filtering offers a more viable path to efficient language modeling than simply maximizing unique data volume. We release our German language models (called Boldt), as well as our cleaned evaluation benchmarks to the research community. Our experiments indicate that they achieve state-of-the-art results despite training on 10-360x fewer tokens than comparable models.

Problem

Research questions and friction points this paper is trying to address.

data filtering

language modeling

training efficiency

data quality vs diversity

non-English LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

high-signal data filtering

sample-efficient language modeling

quality over diversity