Building a Strong Instruction Language Model for a Less-Resourced Language

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant performance degradation of open-source large language models on low-resource languages such as Slovenian by proposing a systematic adaptation approach. Building upon the Gemma 3 architecture, the authors develop GaMS3-12B, a 12-billion-parameter model trained through three-stage continual pretraining on 140 billion multilingual tokens followed by two-stage supervised fine-tuning with over 200,000 bilingual examples, incorporating Slovenian-specific optimization strategies. The resulting model consistently outperforms the base Gemma 3 of comparable scale across Slovenian language understanding, generation, and English-to-Slovenian translation tasks. Notably, GaMS3-12B achieves a win rate exceeding 60% against GPT-4o in LLM Arena benchmarks, substantially narrowing the performance gap between low-resource language models and state-of-the-art commercial systems.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have become an essential tool for natural language processing and artificial intelligence in general. Current open-source models are primarily trained on English texts, resulting in poorer performance on less-resourced languages and cultures. We present a set of methodological approaches necessary for the successful adaptation of an LLM to a less-resourced language, and demonstrate them using the Slovene language. We present GaMS3-12B, a generative model for Slovene with 12 billion parameters, and demonstrate that it is the best-performing open-source model for Slovene within its parameter range. We adapted the model to the Slovene language using three-stage continual pre-training of the Gemma 3 model, followed by two-stage supervised fine-tuning (SFT). We trained the model on a combination of 140B Slovene, English, Bosnian, Serbian, and Croatian pretraining tokens, and over 200 thousand English and Slovene SFT examples. We evaluate GaMS3-12B on the Slovenian-LLM-Eval datasets, English-to-Slovene translation, and the Slovene LLM arena. We show that the described model outperforms 12B Gemma 3 across all three scenarios and performs comparably to much larger commercial GPT-4o in the Slovene LLM arena, achieving a win rate of over 60 %.
Problem

Research questions and friction points this paper is trying to address.

low-resource language
large language model
language adaptation
Slovene
instruction tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

low-resource language adaptation
continual pre-training
supervised fine-tuning
multilingual training
instruction tuning
🔎 Similar Papers
No similar papers found.
D
Domen Vreš
University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, Slovenia
T
Tjaša Arčon
University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, Slovenia
T
Timotej Petrič
University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, Slovenia
D
Dario Vajda
University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, Slovenia
Marko Robnik-Šikonja
Marko Robnik-Šikonja
Professor of Computer Science, University of Ljubljana, Head of ML & LT Lab
Machine LearningArtificial IntelligenceNatural Language ProcessingExplainable AI
I
Iztok Lebar Bajec
University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, Slovenia