🤖 AI Summary
Mainstream large language models (LLMs) exhibit English-centric biases and inadequately support the EU’s 24 official languages. Method: We introduce the first open-source, dual-version 7B LLM designed for pan-European linguistic coverage. It is pretrained on 60% non-English multilingual corpora, employs a cross-lingually optimized tokenizer, adopts balanced language sampling and mixed-language training, and undergoes supervised fine-tuning and instruction alignment to enhance multilingual competence. Contribution/Results: This work establishes the first open-source LLM architecture natively supporting all 24 EU official languages; achieves substantial gains for low-resource languages (+23.5% average improvement); and introduces EU-localized evaluation benchmarks—EU-ARC and EU-HellaSwag. In comprehensive multilingual evaluation, it matches the performance of Llama-3-8B, thereby challenging the English-centric paradigm in LLM development.
📝 Abstract
We present two multilingual LLMs designed to embrace Europe's linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60% non-English data and utilizing a custom multilingual tokenizer, our models address the limitations of existing LLMs that predominantly focus on English or a few high-resource languages. We detail the models' development principles, i.e., data composition, tokenizer optimization, and training methodologies. The models demonstrate competitive performance across multilingual benchmarks, as evidenced by their performance on European versions of ARC, HellaSwag, MMLU, and TruthfulQA.