CorIL: Towards Enriching Indian Language to Indian Language Parallel Corpora and Machine Translation Systems

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-quality, cross-domain parallel corpora for Indian languages are severely scarce, hindering multilingual machine translation (MT) research and deployment. To address this, we introduce IndicParallel—the first large-scale, domain-annotated (government, health, general) parallel corpus covering 11 official Indian languages, comprising 772K sentence pairs. All data undergo rigorous human verification and systematic cleaning to ensure high fidelity. We establish a unified benchmark on leading neural MT models—including IndicTrans2, NLLB, and BhashaVerse—enabling reproducible evaluation. Our empirical analysis is the first to quantitatively reveal a significant performance gap between Persian-Arabic scripts (e.g., Urdu) and Indic scripts (e.g., Devanagari) in multilingual modeling, underscoring both the corpus’s inherent difficulty and its practical utility for domain-aware translation and cross-script transfer studies.

Technology Category

Application Category

📝 Abstract
India's linguistic landscape is one of the most diverse in the world, comprising over 120 major languages and approximately 1,600 additional languages, with 22 officially recognized as scheduled languages in the Indian Constitution. Despite recent progress in multilingual neural machine translation (NMT), high-quality parallel corpora for Indian languages remain scarce, especially across varied domains. In this paper, we introduce a large-scale, high-quality annotated parallel corpus covering 11 of these languages : English, Telugu, Hindi, Punjabi, Odia, Kashmiri, Sindhi, Dogri, Kannada, Urdu, and Gujarati comprising a total of 772,000 bi-text sentence pairs. The dataset is carefully curated and systematically categorized into three key domains: Government, Health, and General, to enable domain-aware machine translation research and facilitate effective domain adaptation. To demonstrate the utility of CorIL and establish strong benchmarks for future research, we fine-tune and evaluate several state-of-the-art NMT models, including IndicTrans2, NLLB, and BhashaVerse. Our analysis reveals important performance trends and highlights the corpus's value in probing model capabilities. For instance, the results show distinct performance patterns based on language script, with massively multilingual models showing an advantage on Perso-Arabic scripts (Urdu, Sindhi) while other models excel on Indic scripts. This paper provides a detailed domain-wise performance analysis, offering insights into domain sensitivity and cross-script transfer learning. By publicly releasing CorIL, we aim to significantly improve the availability of high-quality training data for Indian languages and provide a valuable resource for the machine translation research community.
Problem

Research questions and friction points this paper is trying to address.

Addressing scarce high-quality parallel corpora for Indian languages
Enabling domain-aware machine translation across diverse Indian languages
Establishing benchmarks for cross-script transfer learning analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created large-scale parallel corpus for 11 Indian languages
Categorized data into Government, Health, and General domains
Fine-tuned state-of-the-art NMT models for benchmarking
🔎 Similar Papers
No similar papers found.
S
Soham Bhattacharjee
Department of Computer Science and Engineering, Indian Institute of Technology Patna
M
Mukund K Roy
SNLP Lab, CDAC Noida
Y
Yathish Poojary
Department of Computer Science and Engineering, Manipal Institute of Technology
Bhargav Dave
Bhargav Dave
Phd Scholar Dhirubhai Ambani University
Information RetrievalNatural Language ProcessingBioinformatics
M
Mihir Raj
Department of CSE, IIIT Bhubaneshwar
Vandan Mujadia
Vandan Mujadia
IIIT-Hyderabad
NLPMTML
Baban Gain
Baban Gain
Ph.D., Indian Institute of Technology, Patna
natural language processingmodel mergingmachine translationdeep learningchat translation
Pruthwik Mishra
Pruthwik Mishra
SVNIT, Surat
MLNLPCLMTWord Problem Solving
A
Arafat Ahsan
LTRC, IIIT Hyderabad
P
Parameswari Krishnamurthy
LTRC, IIIT Hyderabad
A
Ashwath Rao
Department of Computer Science and Engineering, Manipal Institute of Technology
G
Gurpreet Singh Josan
Department of CSE, Punjabi University
P
Preeti Dubey
Department of CSE, Govt. College for Women Jammu
A
Aadil Amin Kak
Department of Linguistics, University of Kashmir
A
Anna Rao Kulkarni
VLSI Design Group, CDAC Bangalore
N
Narendra VG
Department of Computer Science and Engineering, Manipal Institute of Technology
S
Sunita Arora
SNLP Lab, CDAC Noida
R
Rakesh Balbantray
Department of CSE, IIIT Bhubaneshwar
P
Prasenjit Majumdar
Department of Computer Science and Engineering, Dhirubhai Ambani University, Gandhinagar
K
Karunesh K Arora
SNLP Lab, CDAC Noida
Asif Ekbal
Asif Ekbal
Department of Computer Science and Engineering, IIT Patna
Artificial IntelligenceNatural Language ProcessingMachine Learning Application
D
Dipti Mishra Sharma
LTRC, IIIT Hyderabad