Kakugo: Distillation of Low-Resource Languages into Small Language Models

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the challenge of developing effective small language models for low-resource languages, which suffer from a scarcity of high-quality training data. The authors propose a fully automated, low-cost pipeline that requires only the name of a target language as input. Leveraging large language models, the method generates and translates synthetic instruction data, which is then used to distill task-specific small language models. Evaluated across 54 low-resource languages, the approach achieves training costs under $50 per language and demonstrates significant performance gains over existing baselines on general NLP tasks—including translation, classification, and question answering—thereby substantially lowering the barrier to AI development for underrepresented languages.

Technology Category

Application Category

📝 Abstract

We present Kakugo, a novel and cost-effective pipeline designed to train general-purpose Small Language Models (SLMs) for low-resource languages using only the language name as input. By using a large teacher model to generate synthetic prompts and translate instruction datasets, we produced training data and SLMs for 54 low-resource languages. Evaluations across a diverse set of general natural language processing tasks, including translation, classification, and question answering, demonstrate that our pipeline consistently improves performance over base models. With a total generation and training cost of under $50 per language, Kakugo offers an accessible method for communities to develop language-specific AI.

Problem

Research questions and friction points this paper is trying to address.

low-resource languages

Small Language Models

language-specific AI

natural language processing

cost-effective training

Innovation

Methods, ideas, or system contributions that make the work stand out.

distillation

low-resource languages

small language models