🤖 AI Summary
This work addresses the scarcity of modern-domain parallel data for Hindi–Sanskrit machine translation, as existing corpora predominantly rely on classical texts and lack coverage of contemporary usage. To bridge this gap, the authors present the first large-scale, systematically constructed Hindi–Sanskrit parallel corpus comprising 92,196 sentence pairs, sourced from modern materials such as spoken-language tutorials, children’s magazines, and broadcast dialogues. Through multi-source collection, rigorous alignment, and careful cleaning, the resulting dataset exhibits substantial lexical and semantic divergence from prior resources. Fine-tuning state-of-the-art models—including ByT5, NLLB, and IndicTrans-v2—on this corpus yields significant performance gains on an in-domain test set while maintaining competitive results on general benchmarks, thereby establishing a new standard for Hindi–Sanskrit neural machine translation.
📝 Abstract
We release Samasāmayik, a novel, meticulously curated, large-scale Hindi-Sanskrit corpus, comprising 92,196 parallel sentences. Unlike most data available in Sanskrit, which focuses on classical era text and poetry, this corpus aggregates data from diverse sources covering contemporary materials, including spoken tutorials, children's magazines, radio conversations, and instruction materials. We benchmark this new dataset by fine-tuning three complementary models - ByT5, NLLB and IndicTrans-v2, to demonstrate its utility. Our experiments demonstrate that models trained on the Samasamayik corpus achieve significant performance gains on in-domain test data, while achieving comparable performance on other widely used test sets, establishing a strong new performance baseline for contemporary Hindi-Sanskrit translation. Furthermore, a comparative analysis against existing corpora reveals minimal semantic and lexical overlap, confirming the novelty and non-redundancy of our dataset as a robust new resource for low-resource Indic language MT.