🤖 AI Summary
To address data scarcity and the trade-off between performance and efficiency in large language model (LLM) training for low-resource languages—particularly extremely low-resource North Sámi—this paper proposes a three-stage progressive continual training paradigm. It builds upon the Mistral architecture and integrates multi-stage continual pretraining, supervised fine-tuning, and multilingual mixed curriculum learning, augmented by data-weighted sampling and language-adaptive LoRA adaptation. We introduce NorMistral-11B, the first open-source 11.4-billion-parameter trilingual generative model supporting Bokmål Norwegian, Nynorsk, and North Sámi. Experimental results demonstrate that NorMistral-11B significantly outperforms existing open-source models on multiple Norwegian and North Sámi benchmarks while achieving a 37% improvement in inference speed. This work establishes a reusable framework and a high-quality open-source foundation model for LLM development across languages with heterogeneous resource levels.
📝 Abstract
Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Northern S'ami. To address this issue, we present a novel three-stage continual training approach that substantially improves the downstream performance together with the inference efficiency for the target languages. Based on our findings, we train, evaluate, and openly release a new generative language model for Norwegian Bokm
{a}l, Nynorsk, and Northern S'ami with 11.4 billion parameters: NorMistral-11B.