🤖 AI Summary
To address the high computational cost, English-centric bias, and privacy risks of existing large language models (LLMs), this work introduces EurO1B: the first open-source, billion-parameter multilingual LLM trained from scratch and specifically optimized for 13 European languages. Methodologically, EurO1B employs balanced multilingual pretraining, domain-aware data curation, a custom tokenizer designed for European linguistic diversity, and supervised instruction fine-tuning. Empirically, despite its smaller scale, EurO1B outperforms the larger EuroLLM across summarization, text classification, and both open- and closed-book question answering. Its open-ended generation quality matches state-of-the-art (SOTA) models. All model weights, tokenizer artifacts, and training code are publicly released under an open license, enabling efficient, privacy-preserving, low-resource local deployment. This contribution advances equitable, transparent, and accessible multilingual AI for European languages.
📝 Abstract
Large language models are powerful but often limited by high computational cost, privacy concerns, and English-centric training. Recent progress demonstrates that small, efficient models with around one billion parameters can deliver strong results and enable on-device use. This paper introduces MiniLingua, a multilingual open-source LLM of one billion parameters trained from scratch for 13 European languages, designed to balance coverage and instruction-following capabilities. Based on evaluation results, the instruction-tuned version of MiniLingua outperforms EuroLLM, a model with a similar training approach but a larger training budget, on summarization, classification and both open- and closed-book question answering. Moreover, it remains competitive with more advanced state-of-the-art models on open-ended generation tasks. We release model weights, tokenizer and source code used for data processing and model training.