🤖 AI Summary
This work addresses the challenge of developing effective small language models for low-resource languages, which suffer from a scarcity of high-quality training data. The authors propose a fully automated, low-cost pipeline that requires only the name of a target language as input. Leveraging large language models, the method generates and translates synthetic instruction data, which is then used to distill task-specific small language models. Evaluated across 54 low-resource languages, the approach achieves training costs under $50 per language and demonstrates significant performance gains over existing baselines on general NLP tasks—including translation, classification, and question answering—thereby substantially lowering the barrier to AI development for underrepresented languages.
📝 Abstract
We present Kakugo, a novel and cost-effective pipeline designed to train general-purpose Small Language Models (SLMs) for low-resource languages using only the language name as input. By using a large teacher model to generate synthetic prompts and translate instruction datasets, we produced training data and SLMs for 54 low-resource languages. Evaluations across a diverse set of general natural language processing tasks, including translation, classification, and question answering, demonstrate that our pipeline consistently improves performance over base models. With a total generation and training cost of under $50 per language, Kakugo offers an accessible method for communities to develop language-specific AI.