INDIC DIALECT: A Multi Task Benchmark to Evaluate and Translate in Indian Language Dialects

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This study addresses the longstanding scarcity of data and evaluation benchmarks for low-resource Indian dialects, which has hindered natural language processing models’ ability to understand and translate these varieties. The authors present the first parallel corpus comprising 13,000 human-annotated sentence pairs across 11 dialects of Hindi and Odia, alongside the first multitask evaluation benchmark for Indian low-resource dialects, encompassing dialect classification, multiple-choice question answering, and machine translation. By integrating fine-tuned Indic pretrained Transformers with a hybrid AI-and-rule-based translation strategy, their system achieves a dialect classification F1 score of 89.8%—a 70.2-percentage-point improvement over baselines—and BLEU scores of 61.32 and 48.44 for translation between dialects and their respective standard languages, substantially outperforming existing large-model baselines.

Technology Category

Application Category

📝 Abstract

Recent NLP advances focus primarily on standardized languages, leaving most low-resource dialects under-served especially in Indian scenarios. In India, the issue is particularly important: despite Hindi being the third most spoken language globally (over 600 million speakers), its numerous dialects remain underrepresented. The situation is similar for Odia, which has around 45 million speakers. While some datasets exist which contain standard Hindi and Odia languages, their regional dialects have almost no web presence. We introduce INDIC-DIALECT, a human-curated parallel corpus of 13k sentence pairs spanning 11 dialects and 2 languages: Hindi and Odia. Using this corpus, we construct a multi-task benchmark with three tasks: dialect classification, multiple-choice question (MCQ) answering, and machine translation (MT). Our experiments show that LLMs like GPT-4o and Gemini 2.5 perform poorly on the classification task. While fine-tuned transformer based models pretrained on Indian languages substantially improve performance e.g., improving F1 from 19.6\% to 89.8\% on dialect classification. For dialect to language translation, we find that hybrid AI model achieves highest BLEU score of 61.32 compared to the baseline score of 23.36. Interestingly, due to complexity in generating dialect sentences, we observe that for language to dialect translation the ``rule-based followed by AI"approach achieves best BLEU score of 48.44 compared to the baseline score of 27.59. INDIC-DIALECT thus is a new benchmark for dialect-aware Indic NLP, and we plan to release it as open source to support further work on low-resource Indian dialects.

Problem

Research questions and friction points this paper is trying to address.

low-resource dialects

Indian languages

dialect underrepresentation

NLP benchmark

dialect translation

Innovation

Methods, ideas, or system contributions that make the work stand out.

dialect-aware NLP

low-resource languages

multi-task benchmark