The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of building culturally adapted AI systems for low-resource languages, this paper proposes a bottom-up, culture-context-aware synthetic data paradigm—bypassing conventional reliance on high-resource language translation. Leveraging large open-source LLMs (≥235B parameters) and Indian-language Wikipedia content, we construct Updesh, a multilingual instruction dataset comprising 9.5 million samples across 13 Indian languages, supporting long-context and multi-turn dialogue tasks. Data quality is ensured via integrated prompt engineering, automated evaluation, and human annotation. Empirical evaluation across 15 multilingual downstream tasks demonstrates that Updesh significantly improves generative performance for low-resource languages while maintaining competitiveness on selection-based NLU tasks, thereby effectively narrowing the performance gap with high-resource languages.

Technology Category

Application Category

📝 Abstract
Developing AI systems that operate effectively across languages while remaining culturally grounded is a long-standing challenge, particularly in low-resource settings. Synthetic data provides a promising avenue, yet its effectiveness in multilingual and multicultural contexts remains underexplored. We investigate the creation and impact of synthetic, culturally contextualized datasets for Indian languages through a bottom-up generation strategy that prompts large open-source LLMs (>= 235B parameters) to ground data generation in language-specific Wikipedia content. This approach complements the dominant top-down paradigm of translating synthetic datasets from high-resource languages such as English. We introduce Updesh, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages, encompassing diverse reasoning and generative tasks with an emphasis on long-context, multi-turn capabilities, and alignment with Indian cultural contexts. A comprehensive evaluation incorporating both automated metrics and human annotation across 10k assessments indicates that generated data is high quality; though, human evaluation highlights areas for further improvement. Additionally, we perform downstream evaluations by fine-tuning models on our dataset and assessing the performance across 15 diverse multilingual datasets. Models trained on Updesh consistently achieve significant gains on generative tasks and remain competitive on multiple-choice style NLU tasks. Notably, relative improvements are most pronounced in low and medium-resource languages, narrowing their gap with high-resource languages. These findings provide empirical evidence that effective multilingual AI requires multi-faceted data curation and generation strategies that incorporate context-aware, culturally grounded methodologies.
Problem

Research questions and friction points this paper is trying to address.

Developing effective multilingual AI systems for low-resource languages
Exploring synthetic data effectiveness in multicultural AI contexts
Addressing cultural grounding challenges in language AI systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses bottom-up LLM prompting with Wikipedia
Creates culturally contextualized synthetic datasets
Evaluates with automated metrics and human annotation
🔎 Similar Papers
No similar papers found.
P
Pranjal A. Chitale
Microsoft Corporation
V
Varun Gumma
Nanyang Technological University
S
Sanchit Ahuja
Northeastern University
Prashant Kodali
Prashant Kodali
IIIT Hyderabd
NLPMultilingual NLPCode-switch languages.
M
Manan Uppadhyay
Microsoft Corporation
D
Deepthi Sudharsan
Independent Researcher
Sunayana Sitaram
Sunayana Sitaram
Microsoft Research India
Multilingual NLPevaluationLLMs and culturemultilingualismLLMs