Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation

📅 2025-04-24

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This study addresses the performance and efficiency limitations of German large language models (LLMs) stemming from insufficient pretraining data quality. To this end, we construct Aleph-Alpha-GermanWeb—a high-quality German pretraining dataset—by integrating Common Crawl, FineWeb2, and conditionally synthesized data generated from real web pages. We propose the first German-specific “model-driven filtering + conditional synthesis” co-design framework, combining heuristic filtering, model confidence scoring, and tokenizer-free hierarchical autoregressive Transformer (HAT) pretraining. Experiments on 1B- and 8B-parameter models demonstrate that our approach significantly outperforms both purely human-curated and conventional web-crawled datasets. Notably, it achieves superior performance across German benchmarks—including MMMLU—even surpassing FineWeb2, which incorporates additional high-quality human-annotated sources such as Wikipedia. Moreover, our method improves both training efficiency and final model performance.

Technology Category

Application Category

📝 Abstract

Scaling data quantity is essential for large language models (LLMs), yet recent findings show that data quality can significantly boost performance and training efficiency. We introduce a German-language dataset curation pipeline that combines heuristic and model-based filtering techniques with synthetic data generation. We use our pipeline to create Aleph-Alpha-GermanWeb, a large-scale German pre-training dataset which draws from: (1) Common Crawl web data, (2) FineWeb2, and (3) synthetically-generated data conditioned on actual, organic web data. We evaluate our dataset by pre-training both a 1B Llama-style model and an 8B tokenizer-free hierarchical autoregressive transformer (HAT). A comparison on German-language benchmarks, including MMMLU, shows significant performance gains of Aleph-Alpha-GermanWeb over FineWeb2 alone. This advantage holds at the 8B scale even when FineWeb2 is enriched by human-curated high-quality data sources such as Wikipedia. Our findings support the growing body of evidence that model-based data curation and synthetic data generation can significantly enhance LLM pre-training datasets.

Problem

Research questions and friction points this paper is trying to address.

Improving German LLM pre-training via data curation

Enhancing dataset quality with synthetic data generation

Boosting performance on German-language benchmarks significantly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-based data curation for German LLMs

Synthetic data generation from web sources

Tokenizer-free hierarchical autoregressive transformer

🔎 Similar Papers

AutoPureData: Automated Filtering of Undesirable Web Data to Update LLM Knowledge