EuroLLM-22B: Technical Report

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited support for European languages in existing open-source large language models by training from scratch a 22-billion-parameter multilingual model covering all 24 official languages of the European Union along with 11 additional languages. The project introduces the first large-scale open-source model to comprehensively encompass all EU official languages, leveraging a custom multilingual tokenizer, extensive cleaning of web-sourced text, and efficient pretraining followed by instruction tuning based on the standard Transformer architecture. The authors release the full pretraining and instruction-tuning datasets alongside complete implementation code. The resulting model achieves state-of-the-art performance among comparable-sized models on multilingual reasoning, instruction-following, and translation benchmarks.

Technology Category

Application Category

📝 Abstract
This report presents EuroLLM-22B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-22B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. Across a broad set of multilingual benchmarks, EuroLLM-22B demonstrates strong performance in reasoning, instruction following, and translation, achieving results competitive with models of comparable size. To support future research, we release our base and instruction-tuned models, our multilingual web pretraining data and updated EuroBlocks instruction datasets, as well as our pre-training and evaluation codebases.
Problem

Research questions and friction points this paper is trying to address.

language model
European languages
underrepresentation
multilingual AI
open LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual LLM
tokenizer design
data filtering
instruction tuning
language coverage
🔎 Similar Papers
No similar papers found.
M
Miguel Moura Ramos
Instituto Superior Técnico & Universidade de Lisboa (Lisbon ELLIS Unit) & Instituto de Telecomunicações
Duarte M. Alves
Duarte M. Alves
PhD Student, Instituto Superior Técnico, Lisbon
Natural Language ProcessingMachine Learning
Hippolyte Gisserot-Boukhlef
Hippolyte Gisserot-Boukhlef
PhD Candidate, CentraleSupélec, Université Paris-Saclay
Artificial IntelligenceLLMs
J
João Alves
Acolad
Pedro Henrique Martins
Pedro Henrique Martins
Sword Health
Healthcare AILarge language ModelsNatural Language ProcessingMachine Learning
Patrick Fernandes
Patrick Fernandes
Carnegie Mellon University & Instituto Superior Técnico
NLPMachine Learning
José Pombal
José Pombal
Sword Health
N
Nuno M. Guerreiro
Sword Health
Ricardo Rei
Ricardo Rei
Sword Health
Healthcare AIMachine LearningNatural Language ProcessingLarge Language Models
Nicolas Boizard
Nicolas Boizard
PhD Studdent @University of Paris-Saclay (MICS - CentraleSupelec) x Diabolocom
NLPArtificial Intelligence
Amin Farajian
Amin Farajian
Unbabel
Natural Language ProcessingMachine Translation
Mateusz Klimaszewski
Mateusz Klimaszewski
PhD Student, Warsaw University of Technology
natural language processingmachine learningmachine translation
José G. C. de Souza
José G. C. de Souza
Principal Research Scientist, Outsystems
Natural Language ProcessingMachine LearningMachine TranslationQuality Estimation for NLP
Barry Haddow
Barry Haddow
University of Edinburgh
NLPmachine translationspoken language translationinformation extraction
François Yvon
François Yvon
ISIR / CNRS et Sorbonne Université
Natural Language ProcessingSpeech ProcessingComputational LinguisticsMachine Translation
Pierre Colombo
Pierre Colombo
CS of Equall & Ass. Prof @Univ ParisSacaly (CentraleSupelec)
NLPMultimodal
Alexandra Birch
Alexandra Birch
University of Edinburgh
Artificial IntelligenceComputational LinguisticsMachine Learning
A
André F. T. Martins
Instituto Superior Técnico & Universidade de Lisboa (Lisbon ELLIS Unit) & Instituto de Telecomunicações & TransPerfect