🤖 AI Summary
Large language models (LLMs) suffer from severe English-centric bias and inadequate support for low-resource languages—e.g., Bavarian—hindering equitable multilingual AI development.
Method: We introduce the first trilingual unified LLM for German, English, and Bavarian. Our approach comprises: (i) constructing the first shared tokenizer across all three languages; (ii) designing a standardized cross-lingual evaluation framework enabling systematic transfer of German benchmarks to Bavarian; and (iii) extending Llama 3.1-8B to 10B parameters and conducting continual pretraining on 164B multilingual tokens using the Cerebras CS-2 accelerator.
Contribution/Results: Key innovation lies in joint cross-lingual optimization of data mixing ratios and architectural hyperparameters. Empirically, our model achieves state-of-the-art performance on Bavarian tasks—outperforming Apertus-8B-2509 and gemma-2-9b—and surpasses EuroLLM in English while matching its German performance. This significantly mitigates English centrism and advances dialectal language technology.
📝 Abstract
We present Llama-GENBA-10B, a trilingual foundation model addressing English-centric bias in large language models. Built on Llama 3.1-8B and scaled to 10B parameters, Llama-GENBA-10B is continuously pretrained on 164B tokens (82B English, 82B German, and 80M Bavarian), balancing resources while preventing English dominance. Targeted at the German NLP community, the model also promotes Bavarian as a low-resource language. Development tackled four challenges: (1) curating a multilingual corpus despite Bavarian scarcity, (2) creating a unified tokenizer for English, German, and Bavarian, (3) optimizing architecture and language-ratio hyperparameters for cross-lingual transfer, and (4) establishing the first standardized trilingual evaluation suite by translating German benchmarks into Bavarian. Evaluations show that Llama-GENBA-10B achieves strong cross-lingual performance, with the fine-tuned variant surpassing Apertus-8B-2509 and gemma-2-9b in Bavarian and establishing itself as the best model in its class for this language, while also outperforming EuroLLM in English and matching its results in German. Training on the Cerebras CS-2 demonstrated efficient large-scale multilingual pretraining with documented energy use, offering a blueprint for inclusive foundation models that integrate low-resource languages.