Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs

📅 2024-09-30
🏛️ arXiv.org
📈 Citations: 8
Influential: 0
📄 PDF
🤖 AI Summary
Mainstream large language models (LLMs) exhibit English-centric biases and inadequately support the EU’s 24 official languages. Method: We introduce the first open-source, dual-version 7B LLM designed for pan-European linguistic coverage. It is pretrained on 60% non-English multilingual corpora, employs a cross-lingually optimized tokenizer, adopts balanced language sampling and mixed-language training, and undergoes supervised fine-tuning and instruction alignment to enhance multilingual competence. Contribution/Results: This work establishes the first open-source LLM architecture natively supporting all 24 EU official languages; achieves substantial gains for low-resource languages (+23.5% average improvement); and introduces EU-localized evaluation benchmarks—EU-ARC and EU-HellaSwag. In comprehensive multilingual evaluation, it matches the performance of Llama-3-8B, thereby challenging the English-centric paradigm in LLM development.

Technology Category

Application Category

📝 Abstract
We present two multilingual LLMs designed to embrace Europe's linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60% non-English data and utilizing a custom multilingual tokenizer, our models address the limitations of existing LLMs that predominantly focus on English or a few high-resource languages. We detail the models' development principles, i.e., data composition, tokenizer optimization, and training methodologies. The models demonstrate competitive performance across multilingual benchmarks, as evidenced by their performance on European versions of ARC, HellaSwag, MMLU, and TruthfulQA.
Problem

Research questions and friction points this paper is trying to address.

Addressing limitations of English-focused large language models
Supporting all 24 official European Union languages
Overcoming bias toward high-resource languages in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual tokenizer optimized for EU languages
Training on 60% non-English European language data
Custom methodology for EU linguistic diversity support
🔎 Similar Papers
No similar papers found.
Mehdi Ali
Mehdi Ali
Fraunhofer IAIS, LAMARR Institute
Machine LearningKnowledge GraphsRelational LearningNLPFoundation Models
Michael Fromm
Michael Fromm
Fraunhofer IAIS
Machine LearningLarge Language ModelsArgument Mining
Klaudia Thellmann
Klaudia Thellmann
Fraunhofer Society (IAIS)
Data ScienceLinked DataBig Data Architectures
Jan Ebert
Jan Ebert
Forschungszentrum Jülich GmbH
Computer scienceartificial intelligencemathematicsphysics
A
Alexander Arno Weber
Fraunhofer IAIS
R
Richard Rutmann
Fraunhofer IAIS
C
Charvi Jain
Fraunhofer IAIS
M
Max Lubbering
Fraunhofer IAIS
D
Daniel Steinigen
Fraunhofer IAIS
Johannes Leveling
Johannes Leveling
Fraunhofer IAIS
K
Katrin Klug
Fraunhofer IAIS
J
Jasper Schulze Buschhoff
Fraunhofer IAIS
L
Lena Jurkschat
TU Dresden
H
Hammam Abdelwahab
Fraunhofer IAIS
B
Benny Jorg Stein
Fraunhofer IAIS
K
Karl-Heinz Sylla
Fraunhofer IAIS
Pavel Denisov
Pavel Denisov
Fraunhofer IAIS
N
Nicolo’ Brandizzi
Fraunhofer IAIS
Q
Qasid Saleem
Fraunhofer IAIS
A
Anirban Bhowmick
Fraunhofer IAIS
L
Lennard Helmer
Fraunhofer IAIS
C
Chelsea John
FZ Jülich
Pedro Ortiz Suarez
Pedro Ortiz Suarez
Principal Research Scientist, Common Crawl Foundation
Language modelingCorpus linguisticsNamed Entity RecognitionComputational LinguisticsMachine
Malte Ostendorff
Malte Ostendorff
University of Göttingen / German Research Center for Artificial Intelligence
Large language modelsRecommender systemsInformation retrieval
A
Alex Jude
Fraunhofer IAIS
L
Lalith Manjunath
TU Dresden
S
Samuel Weinbach
Aleph Alpha
C
Carolin Penke
FZ Jülich
O
Oleg Filatov
FZ Jülich
S
Shima Asaadi
Fraunhofer IIS
Fabio Barth
Fabio Barth
DFKI
Computer Science
R
R. Sifa
Fraunhofer IAIS
F
Fabian Kuch
Fraunhofer IIS
A
A. Herten
FZ Jülich
R
Ren'e Jakel
TU Dresden
Georg Rehm
Georg Rehm
Principal Researcher and Research Fellow, DFKI GmbH
Natural Language ProcessingArtificial IntelligenceLanguage TechnologyComputational LinguisticsSemantic Web
Stefan Kesselheim
Stefan Kesselheim
Jülich Supercomputing Center, Jülich Research Centre
Machine LearningComputer Simulation MethodsStatistical Mechanics
J
Joachim Kohler
Fraunhofer IAIS
N
Nicolas Flores-Herr
Fraunhofer IAIS