Multilingual TinyStories: A Synthetic Combinatorial Corpus of Indic Children's Stories for Training Small Language Models

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the scarcity of high-quality, domain-adapted training corpora for small language models in low-resource Indian languages. To this end, the authors propose a hybrid data construction approach that integrates local generation with cross-lingual expansion. Leveraging the Sarvam-M model within a compositional prompt engineering framework, they generate native-language content and further augment it through multilingual expansion using the Google Translate API, complemented by programmatic filtering to ensure quality. The resulting synthetic corpus spans 17 Indian languages and comprises 132,942 children’s stories—over 93.9 million tokens—in concise, narratively coherent texts strictly rendered in native scripts. This resource provides a foundational dataset for training and transfer learning of small language models targeting low-resource Indian languages.

Technology Category

Application Category

📝 Abstract

The development of robust language models for low-resource languages is frequently bottlenecked by the scarcity of high-quality, coherent, and domain-appropriate training corpora. In this paper, we introduce the Multilingual TinyStories dataset, a large-scale, synthetically generated collection of children's stories encompassing 17 Indian languages. Designed specifically for the training and evaluation of Small Language Models (SLMs), the corpus provides simple, narrative-driven text strictly localized to native scripts. We detail our hybrid curation pipeline, which leverages the Sarvam-M language model and a novel combinatorial prompt engineering framework for native generation, coupled with the Google Translate API for large-scale cross-lingual expansion. Through strict programmatic filtering, we compiled 132,942 stories and over 93.9 million tokens in our release, serving as a foundational resource for multilingual language modeling and transfer learning in the Indic linguistic sphere.

Problem

Research questions and friction points this paper is trying to address.

low-resource languages

training corpora

children's stories

multilingual language modeling

Indic languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic corpus

combinatorial prompt engineering

small language models