Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Instruction-tuning data scarcity severely hinders the development of capable instruction-following models for low-resource languages like Basque. Method: This paper proposes a fully automated, annotation-free instruction-alignment paradigm. Leveraging Llama 3.1-Instruct 70B as the backbone, it employs only 1.2B tokens of native monolingual Basque text to self-generate high-quality synthetic instructions, applies lightweight LoRA fine-tuning, and incorporates large-scale human preference evaluation (1,680 annotators) to guide alignment. Contribution/Results: We present the first empirical validation that instruction-tuned models substantially outperform their base counterparts in low-resource settings. Crucially, our approach—relying solely on synthetic instructions and target-language monolingual data—achieves performance competitive with state-of-the-art large language models on Basque NLP tasks. All code, fine-tuned models, synthetic instruction datasets, and human preference data are publicly released.

Technology Category

Application Category

📝 Abstract
Instructing language models with user intent requires large instruction datasets, which are only available for a limited set of languages. In this paper, we explore alternatives to conventional instruction adaptation pipelines in low-resource scenarios. We assume a realistic scenario for low-resource languages, where only the following are available: corpora in the target language, existing open-weight multilingual base and instructed backbone LLMs, and synthetically generated instructions sampled from the instructed backbone. We present a comprehensive set of experiments for Basque that systematically study different combinations of these components evaluated on benchmarks and human preferences from 1,680 participants. Our conclusions show that target language corpora are essential, with synthetic instructions yielding robust models, and, most importantly, that using as backbone an instruction-tuned model outperforms using a base non-instructed model, and improved results when scaling up. Using Llama 3.1 instruct 70B as backbone our model comes near frontier models of much larger sizes for Basque, without using any Basque data apart from the 1.2B word corpora. We release code, models, instruction datasets, and human preferences to support full reproducibility in future research on low-resource language adaptation.
Problem

Research questions and friction points this paper is trying to address.

Exploring instruction adaptation for low-resource languages like Basque
Evaluating synthetic instructions and multilingual models for Basque
Improving performance with instruction-tuned backbones and scaling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes synthetic instructions from multilingual LLMs
Leverages instruction-tuned backbone model for better performance
Combines target language corpora with synthetic data
Oscar Sainz
Oscar Sainz
University of the Basque Country (UPV/EHU)
Computer ScienceArtificial InteligenceNatural Language ProcessingInformation Extraction
N
Naiara Pérez
HiTZ Center - Ixa, University of the Basque Country UPV/EHU
Julen Etxaniz
Julen Etxaniz
PhD Student in NLP, HiTZ, University of the Basque Country
MultilingualityNLPDLMLAI
J
Joseba Fernandez de Landa
HiTZ Center - Ixa, University of the Basque Country UPV/EHU
I
Itziar Aldabe
HiTZ Center - Ixa, University of the Basque Country UPV/EHU
I
Iker Garc'ia-Ferrero
HiTZ Center - Ixa, University of the Basque Country UPV/EHU
A
Aimar Zabala
HiTZ Center - Ixa, University of the Basque Country UPV/EHU
E
Ekhi Azurmendi
HiTZ Center - Ixa, University of the Basque Country UPV/EHU
G
Germán Rigau
HiTZ Center - Ixa, University of the Basque Country UPV/EHU
E
Eneko Agirre
HiTZ Center - Ixa, University of the Basque Country UPV/EHU
Mikel Artetxe
Mikel Artetxe
Reka AI
MultilingualityNLPMachine LearningAI
A
A. Soroa
HiTZ Center - Ixa, University of the Basque Country UPV/EHU