🤖 AI Summary
This work addresses the limited support for European languages in existing open-source large language models by training from scratch a 22-billion-parameter multilingual model covering all 24 official languages of the European Union along with 11 additional languages. The project introduces the first large-scale open-source model to comprehensively encompass all EU official languages, leveraging a custom multilingual tokenizer, extensive cleaning of web-sourced text, and efficient pretraining followed by instruction tuning based on the standard Transformer architecture. The authors release the full pretraining and instruction-tuning datasets alongside complete implementation code. The resulting model achieves state-of-the-art performance among comparable-sized models on multilingual reasoning, instruction-following, and translation benchmarks.
📝 Abstract
This report presents EuroLLM-22B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-22B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. Across a broad set of multilingual benchmarks, EuroLLM-22B demonstrates strong performance in reasoning, instruction following, and translation, achieving results competitive with models of comparable size. To support future research, we release our base and instruction-tuned models, our multilingual web pretraining data and updated EuroBlocks instruction datasets, as well as our pre-training and evaluation codebases.