Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Low-resource languages like Kazakh suffer from a scarcity of high-quality, domain-specific instruction-tuning datasets—particularly in government and cultural domains—hindering the development of reliable large language models (LLMs) for public-sector applications. Method: We introduce the first high-quality Kazakh-language instruction dataset comprising 10,600 samples, covering legal frameworks, administrative procedures, and national cultural knowledge. We propose a novel “human full-verification + GPT-4o collaborative generation” paradigm to ensure accuracy, diversity, and domain fidelity. Contribution/Results: The dataset—the largest publicly available Kazakh instruction dataset focused on governance and culture—is open-sourced. Supervised instruction fine-tuning on Qwen, Falcon, and Gemma yields significant performance gains on both multiple-choice and open-ended generation tasks. Empirical evaluation confirms the effectiveness, reproducibility, and practical utility of integrating LLM-assisted data synthesis with rigorous human verification for low-resource language governance applications.

Technology Category

Application Category

📝 Abstract

Instruction tuning in low-resource languages remains underexplored due to limited text data, particularly in government and cultural domains. To address this, we introduce and open-source a large-scale (10,600 samples) instruction-following (IFT) dataset, covering key institutional and cultural knowledge relevant to Kazakhstan. Our dataset enhances LLMs' understanding of procedural, legal, and structural governance topics. We employ LLM-assisted data generation, comparing open-weight and closed-weight models for dataset construction, and select GPT-4o as the backbone. Each entity of our dataset undergoes full manual verification to ensure high quality. We also show that fine-tuning Qwen, Falcon, and Gemma on our dataset leads to consistent performance improvements in both multiple-choice and generative tasks, demonstrating the potential of LLM-assisted instruction tuning for low-resource languages.

Problem

Research questions and friction points this paper is trying to address.

Instruction tuning for low-resource languages

Enhancing LLMs with Kazakh government data

LLM-assisted data generation for cultural domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-assisted large-scale dataset generation

Manual verification ensures high-quality data

Fine-tuning enhances LLM performance significantly

🔎 Similar Papers

Self-Alignment: Improving Alignment of Cultural Values in LLMs via In-Context Learning