🤖 AI Summary
Addressing the translation challenges posed by script diversity, low-resource conditions, and grammatical heterogeneity across 36 languages of the Indian subcontinent, this work introduces the first unified framework for parallel corpus construction and neural machine translation covering all pairwise language combinations. Methodologically, it integrates script normalization, cross-lingual synthetic data augmentation, domain-adaptive neural machine translation, and discourse-level modeling, complemented by a multidimensional evaluation suite incorporating both reference-based and reference-free metrics. The project delivers the largest multiscipt parallel corpus to date for this language family, yielding substantial improvements in translation quality for prototypical low-resource languages such as Khasi and Santali. It achieves state-of-the-art performance on both general-domain and domain-specific benchmarks, establishing a scalable technical paradigm for machine translation of highly diverse, multiscipt language families.
📝 Abstract
This paper focuses on developing translation models and related applications for 36 Indian languages, including Assamese, Awadhi, Bengali, Bhojpuri, Braj, Bodo, Dogri, English, Konkani, Gondi, Gujarati, Hindi, Hinglish, Ho, Kannada, Kangri, Kashmiri (Arabic and Devanagari), Khasi, Mizo, Magahi, Maithili, Malayalam, Marathi, Manipuri (Bengali and Meitei), Nepali, Oriya, Punjabi, Sanskrit, Santali, Sinhala, Sindhi (Arabic and Devanagari), Tamil, Tulu, Telugu, and Urdu. Achieving this requires parallel and other types of corpora for all 36 * 36 language pairs, addressing challenges like script variations, phonetic differences, and syntactic diversity. For instance, languages like Kashmiri and Sindhi, which use multiple scripts, demand script normalization for alignment, while low-resource languages such as Khasi and Santali require synthetic data augmentation to ensure sufficient coverage and quality. To address these challenges, this work proposes strategies for corpus creation by leveraging existing resources, developing parallel datasets, generating domain-specific corpora, and utilizing synthetic data techniques. Additionally, it evaluates machine translation across various dimensions, including standard and discourse-level translation, domain-specific translation, reference-based and reference-free evaluation, error analysis, and automatic post-editing. By integrating these elements, the study establishes a comprehensive framework to improve machine translation quality and enable better cross-lingual communication in India's linguistically diverse ecosystem.