Small Languages, Big Models: A Study of Continual Training on Languages of Norway

📅 2024-12-09

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address data scarcity and the trade-off between performance and efficiency in large language model (LLM) training for low-resource languages—particularly extremely low-resource North Sámi—this paper proposes a three-stage progressive continual training paradigm. It builds upon the Mistral architecture and integrates multi-stage continual pretraining, supervised fine-tuning, and multilingual mixed curriculum learning, augmented by data-weighted sampling and language-adaptive LoRA adaptation. We introduce NorMistral-11B, the first open-source 11.4-billion-parameter trilingual generative model supporting Bokmål Norwegian, Nynorsk, and North Sámi. Experimental results demonstrate that NorMistral-11B significantly outperforms existing open-source models on multiple Norwegian and North Sámi benchmarks while achieving a 37% improvement in inference speed. This work establishes a reusable framework and a high-quality open-source foundation model for LLM development across languages with heterogeneous resource levels.

Technology Category

Application Category

📝 Abstract

Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Northern S'ami. To address this issue, we present a novel three-stage continual training approach that substantially improves the downstream performance together with the inference efficiency for the target languages. Based on our findings, we train, evaluate, and openly release a new generative language model for Norwegian Bokm {a}l, Nynorsk, and Northern S'ami with 11.4 billion parameters: NorMistral-11B.

Problem

Research questions and friction points this paper is trying to address.

Low-resource Languages

Large-scale Language Models

Data Scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-step Training Method

Large Language Model

Norwegian and North Sami Languages

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models