Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs

📅 2024-09-30

🏛️ arXiv.org

📈 Citations: 8

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Mainstream large language models (LLMs) exhibit English-centric biases and inadequately support the EU’s 24 official languages. Method: We introduce the first open-source, dual-version 7B LLM designed for pan-European linguistic coverage. It is pretrained on 60% non-English multilingual corpora, employs a cross-lingually optimized tokenizer, adopts balanced language sampling and mixed-language training, and undergoes supervised fine-tuning and instruction alignment to enhance multilingual competence. Contribution/Results: This work establishes the first open-source LLM architecture natively supporting all 24 EU official languages; achieves substantial gains for low-resource languages (+23.5% average improvement); and introduces EU-localized evaluation benchmarks—EU-ARC and EU-HellaSwag. In comprehensive multilingual evaluation, it matches the performance of Llama-3-8B, thereby challenging the English-centric paradigm in LLM development.

Technology Category

Application Category

📝 Abstract

We present two multilingual LLMs designed to embrace Europe's linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60% non-English data and utilizing a custom multilingual tokenizer, our models address the limitations of existing LLMs that predominantly focus on English or a few high-resource languages. We detail the models' development principles, i.e., data composition, tokenizer optimization, and training methodologies. The models demonstrate competitive performance across multilingual benchmarks, as evidenced by their performance on European versions of ARC, HellaSwag, MMLU, and TruthfulQA.

Problem

Research questions and friction points this paper is trying to address.

Addressing limitations of English-focused large language models

Supporting all 24 official European Union languages

Overcoming bias toward high-resource languages in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual tokenizer optimized for EU languages

Training on 60% non-English European language data

Custom methodology for EU linguistic diversity support

🔎 Similar Papers

No similar papers found.