BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages

📅 2025-11-13
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of pretraining data for low-resource Indian languages, this paper introduces BhashaKritika, a multilingual synthetic data construction framework covering 10 Indian languages and 54 billion tokens. Methodologically, it proposes the first document–role–topic co-guided synthetic generation paradigm, integrating five complementary generation techniques; it further designs a modular quality assurance pipeline incorporating script/language identification, metadata consistency verification, n-gram deduplication, and KenLM perplexity filtering—enabling efficient cross-script and cross-lingual quality control. Comprehensive experiments systematically characterize the quality–diversity trade-offs across generation strategies, establishing best practices for multilingual synthetic corpus construction. Empirical results demonstrate that models pretrained on BhashaKritika achieve substantial performance gains across Indian languages, providing a reusable data infrastructure and methodological blueprint for low-resource multilingual LLM development.

Technology Category

Application Category

📝 Abstract
In the context of pretraining of Large Language Models (LLMs), synthetic data has emerged as an alternative for generating high-quality pretraining data at scale. This is particularly beneficial in low-resource language settings where the benefits of recent LLMs have been unevenly distributed across languages. In this work, we present a systematic study on the generation and evaluation of synthetic multilingual pretraining data for Indic languages, where we construct a large-scale synthetic dataset BhashaKritika, comprising 540B tokens using 5 different techniques for 10 languages. We explore the impact of grounding generation in documents, personas, and topics. We analyze how language choice, both in the prompt instructions and document grounding, affects data quality, and we compare translations of English content with native generation in Indic languages. To support scalable and language-sensitive evaluation, we introduce a modular quality evaluation pipeline that integrates script and language detection, metadata consistency checks, n-gram repetition analysis, and perplexity-based filtering using KenLM models. Our framework enables robust quality control across diverse scripts and linguistic contexts. Empirical results through model runs reveal key trade-offs in generation strategies and highlight best practices for constructing effective multilingual corpora.
Problem

Research questions and friction points this paper is trying to address.

Generating scalable synthetic pretraining data for low-resource Indic languages
Evaluating multilingual data quality across diverse scripts and linguistic contexts
Comparing translation-based and native generation approaches for Indic language corpora
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generated synthetic multilingual data using five techniques
Introduced modular quality evaluation pipeline with multiple metrics
Compared translation versus native generation in Indic languages
🔎 Similar Papers
No similar papers found.
G
Guduru Manoj
Krutrim, India
N
Neel Prabhanjan Rachamalla
Krutrim, India
Ashish Kulkarni
Ashish Kulkarni
Krutrim
Artificial intelligencemachine learningNatural Language Processing
G
Gautam Rajeev
Krutrim, India
J
Jay Piplodiya
Krutrim, India
Arul Menezes
Arul Menezes
Microsoft Research
Machine TranslationNatural Language Processing
Shaharukh Khan
Shaharukh Khan
Unknown affiliation
Machine LearningVLM
S
Souvik Rana
Krutrim, India
M
Manya Sah
Krutrim, India
Chandra Khatri
Chandra Khatri
Ola Krutrim AI
Artificial IntelligenceMulti-Modal AIConversational AIDeep LearningMachine Learning
S
Shubham Agarwal
Krutrim, India