EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant performance gap of multilingual large language models on low-resource languages such as Estonian, while maintaining strong capabilities in high-resource languages and general tasks. Building upon Llama 3.1 8B, the authors propose a balanced multilingual data mixing strategy for continued pretraining, augmented with English replay and enriched with code, mathematical, and instructional data. The model is further aligned through supervised fine-tuning, preference optimization, and chat vector fusion techniques. This approach yields substantial improvements across Estonian language understanding, knowledge recall, reasoning, translation, and instruction-following benchmarks, while preserving competitive performance on English and general-purpose evaluations, thereby achieving an effective balance between low-resource language enhancement and overall multilingual competence.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are predominantly trained on English-centric data, resulting in uneven performance for smaller languages. We study whether continued pretraining (CPT) can substantially improve Estonian capabilities in a pretrained multilingual LLM while preserving its English and general reasoning performance. Using Llama 3.1 8B as the main base model, we perform CPT on a mixture that increases Estonian exposure while approximating the original training distribution through English replay and the inclusion of code, mathematics, and instruction-like data. We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior. Evaluation on a comprehensive suite of Estonian benchmarks shows consistent gains in linguistic competence, knowledge, reasoning, translation quality, and instruction-following compared to the original base model and its instruction-tuned variant, while maintaining competitive performance on English benchmarks. These findings indicate that CPT, with an appropriately balanced data mixture, together with post-training alignment, can substantially improve single-language capabilities in pretrained multilingual LLMs.
Problem

Research questions and friction points this paper is trying to address.

multilingual LLMs
Estonian language
language imbalance
low-resource languages
cross-lingual performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

continued pretraining
multilingual LLMs
Estonian language
post-training alignment
data mixture
🔎 Similar Papers
No similar papers found.
Aleksei Dorkin
Aleksei Dorkin
University of Tartu
Natural Language ProcessingInformation Retrieval#tartunlp#unitartucs
Taido Purason
Taido Purason
University of Tartu
large language modelsnatural language processingmachine translation#unitartucs#tartunlp
E
Emil Kalbaliyev
Institute of Computer Science, University of Tartu, Tartu, Estonia
Hele-Andra Kuulmets
Hele-Andra Kuulmets
University of Tartu
natural language processinglarge language modelslow-resouce languages#unitartucs#tartunlp
M
Marii Ojastu
Institute of Computer Science, University of Tartu, Tartu, Estonia
M
Mark Fišel
Institute of Computer Science, University of Tartu, Tartu, Estonia
Tanel Alumäe
Tanel Alumäe
Professor of Speech Processing, Tallinn University of Technology
Speech recognitionNatural language processing
E
Eleri Aedmaa
Institute of the Estonian Language, Tallinn, Estonia
K
Krister Kruusmaa
Institute of the Estonian Language, Tallinn, Estonia; School of Humanities, Tallinn University, Tallinn, Estonia
Kairit Sirts
Kairit Sirts
University of Tartu
Natural Language ProcessingComputational LinguisticsComputational Psychology#unitartucs