🤖 AI Summary
This work addresses German natural language understanding (NLU), text embedding, and long-context reasoning under resource constraints. We propose ModernGBERT—a natively German encoder architecture—trained from scratch in two scales (134M and 1B parameters)—and systematically compare it against encoder variants derived from the German decoder LLM LLäMmlein via LLM2Vec. To our knowledge, this is the first full-scale pretraining of a monolingual German encoder, incorporating ModernBERT architectural enhancements. We introduce the first controllable multi-task evaluation benchmark for German—including GLUE-de, STS-de, and LongBEIR-de—to uniformly assess both native and converted encoders across performance and parameter efficiency. Experiments demonstrate that ModernGBERT-1B surpasses prior state-of-the-art German encoders and all LLäMmlein2Vec variants on multiple NLU and embedding tasks, achieving higher accuracy with superior parameter efficiency. All models, datasets, and code are publicly released.
📝 Abstract
Despite the prominence of decoder-only language models, encoders remain crucial for resource-constrained applications. We introduce ModernGBERT (134M, 1B), a fully transparent family of German encoder models trained from scratch, incorporating architectural innovations from ModernBERT. To evaluate the practical trade-offs of training encoders from scratch, we also present LL""aMmlein2Vec (120M, 1B, 7B), a family of encoders derived from German decoder-only models via LLM2Vec. We benchmark all models on natural language understanding, text embedding, and long-context reasoning tasks, enabling a controlled comparison between dedicated encoders and converted decoders. Our results show that ModernGBERT 1B outperforms prior state-of-the-art German encoders as well as encoders adapted via LLM2Vec, with regard to performance and parameter-efficiency. All models, training data, checkpoints and code are publicly available, advancing the German NLP ecosystem with transparent, high-performance encoder models.