EuroLLM-9B: Technical Report

📅 2025-06-04
📈 Citations: 0
✹ Influential: 0
📄 PDF
đŸ€– AI Summary
To address the severe underrepresentation of European languages—particularly low-resource ones—in existing open-source large language models (LLMs), this work introduces the first from-scratch trained, fully open-source multilingual LLM supporting all 24 official EU languages plus 11 additional European languages. We propose EuroFilter, a novel multilingual data filtering framework, and EuroBlocks-Synthetic, a high-quality synthetic multilingual dataset. Our methodology integrates a custom tokenizer, multi-stage data cleaning, synthetic data augmentation, instruction fine-tuning, and explicit multilingual alignment training. Evaluated on multilingual understanding and machine translation benchmarks, the model achieves state-of-the-art performance among open-source LLMs. All model weights, the EuroFilter source code, and the EuroBlocks-Synthetic dataset are publicly released under permissive open licenses to foster reproducible research and equitable multilingual AI development.

Technology Category

Application Category

📝 Abstract
This report presents EuroLLM-9B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-9B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. We describe the pre-training data collection and filtering pipeline, including the creation of EuroFilter, an AI-based multilingual filter, as well as the design of EuroBlocks-Synthetic, a novel synthetic dataset for post-training that enhances language coverage for European languages. Evaluation results demonstrate EuroLLM-9B's competitive performance on multilingual benchmarks and machine translation tasks, establishing it as the leading open European-made LLM of its size. To support open research and adoption, we release all major components of this work, including the base and instruction-tuned models, the EuroFilter classifier, and the synthetic post-training dataset.
Problem

Research questions and friction points this paper is trying to address.

Addresses underrepresentation of European languages in open LLMs
Develops a multilingual model covering 24 EU and 11 extra languages
Enhances language coverage via synthetic data and AI filtering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training a multilingual model from scratch
Creating AI-based multilingual data filter
Developing synthetic dataset for post-training
🔎 Similar Papers
No similar papers found.
P
P. Martins
Unbabel & Instituto de TelecomunicaçÔes & Instituto Superior Técnico, Universidade de Lisboa
J
JoĂŁo Alves
Unbabel & Instituto de TelecomunicaçÔes & Instituto Superior Técnico, Universidade de Lisboa
Patrick Fernandes
Patrick Fernandes
Carnegie Mellon University & Instituto Superior Técnico
NLPMachine Learning
N
Nuno M. Guerreiro
Unbabel & Instituto de TelecomunicaçÔes & Instituto Superior Técnico, Universidade de Lisboa & MICS, CentraleSupélec, Université Paris-Saclay
Ricardo Rei
Ricardo Rei
Sword Health
Healthcare AIMachine LearningNatural Language ProcessingLarge Language Models
Amin Farajian
Amin Farajian
Unbabel
Natural Language ProcessingMachine Translation
Mateusz Klimaszewski
Mateusz Klimaszewski
PhD Student, Warsaw University of Technology
natural language processingmachine learningmachine translation
Duarte M. Alves
Duarte M. Alves
PhD Student, Instituto Superior Técnico, Lisbon
Natural Language ProcessingMachine Learning
J
José P. Pombal
Unbabel & Instituto de TelecomunicaçÔes & Instituto Superior Técnico, Universidade de Lisboa
Manuel Faysse
Manuel Faysse
CentraleSupélec - Université Paris Saclay
Natural Language ProcessingMachine LearningPrivacy
Pierre Colombo
Pierre Colombo
CS of Equall & Ass. Prof @Univ ParisSacaly (CentraleSupelec)
NLPMultimodal
F
Franccois Yvon
Sorbonne Université, CNRS, ISIR
Barry Haddow
Barry Haddow
University of Edinburgh
NLPmachine translationspoken language translationinformation extraction
J
J. G. C. D. Souza
Unbabel & Instituto de TelecomunicaçÔes & Instituto Superior Técnico, Universidade de Lisboa
Alexandra Birch
Alexandra Birch
University of Edinburgh
Artificial IntelligenceComputational LinguisticsMachine Learning
A
Andr'e F. T. Martins
Unbabel & Instituto de TelecomunicaçÔes & Instituto Superior Técnico, Universidade de Lisboa