Small Languages, Big Models: A Study of Continual Training on Languages of Norway

📅 2024-12-09
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address data scarcity and the trade-off between performance and efficiency in large language model (LLM) training for low-resource languages—particularly extremely low-resource North Sámi—this paper proposes a three-stage progressive continual training paradigm. It builds upon the Mistral architecture and integrates multi-stage continual pretraining, supervised fine-tuning, and multilingual mixed curriculum learning, augmented by data-weighted sampling and language-adaptive LoRA adaptation. We introduce NorMistral-11B, the first open-source 11.4-billion-parameter trilingual generative model supporting Bokmål Norwegian, Nynorsk, and North Sámi. Experimental results demonstrate that NorMistral-11B significantly outperforms existing open-source models on multiple Norwegian and North Sámi benchmarks while achieving a 37% improvement in inference speed. This work establishes a reusable framework and a high-quality open-source foundation model for LLM development across languages with heterogeneous resource levels.

Technology Category

Application Category

📝 Abstract
Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Northern S'ami. To address this issue, we present a novel three-stage continual training approach that substantially improves the downstream performance together with the inference efficiency for the target languages. Based on our findings, we train, evaluate, and openly release a new generative language model for Norwegian Bokm {a}l, Nynorsk, and Northern S'ami with 11.4 billion parameters: NorMistral-11B.
Problem

Research questions and friction points this paper is trying to address.

Low-resource Languages
Large-scale Language Models
Data Scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-step Training Method
Large Language Model
Norwegian and North Sami Languages
🔎 Similar Papers
No similar papers found.