🤖 AI Summary
This work addresses the significant performance gap of multilingual large language models on low-resource languages such as Estonian, while maintaining strong capabilities in high-resource languages and general tasks. Building upon Llama 3.1 8B, the authors propose a balanced multilingual data mixing strategy for continued pretraining, augmented with English replay and enriched with code, mathematical, and instructional data. The model is further aligned through supervised fine-tuning, preference optimization, and chat vector fusion techniques. This approach yields substantial improvements across Estonian language understanding, knowledge recall, reasoning, translation, and instruction-following benchmarks, while preserving competitive performance on English and general-purpose evaluations, thereby achieving an effective balance between low-resource language enhancement and overall multilingual competence.
📝 Abstract
Large language models (LLMs) are predominantly trained on English-centric data, resulting in uneven performance for smaller languages. We study whether continued pretraining (CPT) can substantially improve Estonian capabilities in a pretrained multilingual LLM while preserving its English and general reasoning performance. Using Llama 3.1 8B as the main base model, we perform CPT on a mixture that increases Estonian exposure while approximating the original training distribution through English replay and the inclusion of code, mathematics, and instruction-like data. We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior. Evaluation on a comprehensive suite of Estonian benchmarks shows consistent gains in linguistic competence, knowledge, reasoning, translation quality, and instruction-following compared to the original base model and its instruction-tuned variant, while maintaining competitive performance on English benchmarks. These findings indicate that CPT, with an appropriately balanced data mixture, together with post-training alignment, can substantially improve single-language capabilities in pretrained multilingual LLMs.