Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian

📅 2025-09-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from severe English-centric bias and inadequate support for low-resource languages—e.g., Bavarian—hindering equitable multilingual AI development. Method: We introduce the first trilingual unified LLM for German, English, and Bavarian. Our approach comprises: (i) constructing the first shared tokenizer across all three languages; (ii) designing a standardized cross-lingual evaluation framework enabling systematic transfer of German benchmarks to Bavarian; and (iii) extending Llama 3.1-8B to 10B parameters and conducting continual pretraining on 164B multilingual tokens using the Cerebras CS-2 accelerator. Contribution/Results: Key innovation lies in joint cross-lingual optimization of data mixing ratios and architectural hyperparameters. Empirically, our model achieves state-of-the-art performance on Bavarian tasks—outperforming Apertus-8B-2509 and gemma-2-9b—and surpasses EuroLLM in English while matching its German performance. This significantly mitigates English centrism and advances dialectal language technology.

Technology Category

Application Category

📝 Abstract
We present Llama-GENBA-10B, a trilingual foundation model addressing English-centric bias in large language models. Built on Llama 3.1-8B and scaled to 10B parameters, Llama-GENBA-10B is continuously pretrained on 164B tokens (82B English, 82B German, and 80M Bavarian), balancing resources while preventing English dominance. Targeted at the German NLP community, the model also promotes Bavarian as a low-resource language. Development tackled four challenges: (1) curating a multilingual corpus despite Bavarian scarcity, (2) creating a unified tokenizer for English, German, and Bavarian, (3) optimizing architecture and language-ratio hyperparameters for cross-lingual transfer, and (4) establishing the first standardized trilingual evaluation suite by translating German benchmarks into Bavarian. Evaluations show that Llama-GENBA-10B achieves strong cross-lingual performance, with the fine-tuned variant surpassing Apertus-8B-2509 and gemma-2-9b in Bavarian and establishing itself as the best model in its class for this language, while also outperforming EuroLLM in English and matching its results in German. Training on the Cerebras CS-2 demonstrated efficient large-scale multilingual pretraining with documented energy use, offering a blueprint for inclusive foundation models that integrate low-resource languages.
Problem

Research questions and friction points this paper is trying to address.

Addressing English-centric bias in large language models
Promoting Bavarian as a low-resource language
Creating unified multilingual model for three languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuously pretrained trilingual model on balanced corpus
Unified tokenizer for English, German and Bavarian languages
First standardized trilingual evaluation suite with translated benchmarks
🔎 Similar Papers
No similar papers found.
Michael Hoffmann
Michael Hoffmann
ETH Zürich
GeometryGraphsAlgorithmsComplexity
J
Jophin John
Leibniz Supercomputing Centre (LRZ)
S
Stefan Schweter
Independent Researcher
G
Gokul Ramakrishnan
Cerebras Systems
H
Hoi-Fong Mak
Leibniz Supercomputing Centre (LRZ)
Alice Zhang
Alice Zhang
University of Texas at Austin
Wearable computingAudio and speech processing
D
Dmitry Gaynullin
Cerebras Systems
N
Nicolay J. Hammer
Leibniz Supercomputing Centre (LRZ)