MiniLingua: A Small Open-Source LLM for European Languages

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost, English-centric bias, and privacy risks of existing large language models (LLMs), this work introduces EurO1B: the first open-source, billion-parameter multilingual LLM trained from scratch and specifically optimized for 13 European languages. Methodologically, EurO1B employs balanced multilingual pretraining, domain-aware data curation, a custom tokenizer designed for European linguistic diversity, and supervised instruction fine-tuning. Empirically, despite its smaller scale, EurO1B outperforms the larger EuroLLM across summarization, text classification, and both open- and closed-book question answering. Its open-ended generation quality matches state-of-the-art (SOTA) models. All model weights, tokenizer artifacts, and training code are publicly released under an open license, enabling efficient, privacy-preserving, low-resource local deployment. This contribution advances equitable, transparent, and accessible multilingual AI for European languages.

Technology Category

Application Category

📝 Abstract
Large language models are powerful but often limited by high computational cost, privacy concerns, and English-centric training. Recent progress demonstrates that small, efficient models with around one billion parameters can deliver strong results and enable on-device use. This paper introduces MiniLingua, a multilingual open-source LLM of one billion parameters trained from scratch for 13 European languages, designed to balance coverage and instruction-following capabilities. Based on evaluation results, the instruction-tuned version of MiniLingua outperforms EuroLLM, a model with a similar training approach but a larger training budget, on summarization, classification and both open- and closed-book question answering. Moreover, it remains competitive with more advanced state-of-the-art models on open-ended generation tasks. We release model weights, tokenizer and source code used for data processing and model training.
Problem

Research questions and friction points this paper is trying to address.

Develops a small, open-source multilingual LLM for European languages
Addresses high computational cost, privacy, and English-centric limitations
Balances language coverage with instruction-following capabilities efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Small open-source LLM for 13 European languages
One billion parameters trained from scratch for efficiency
Instruction-tuned version outperforms larger models on tasks
🔎 Similar Papers
No similar papers found.