Samasāmayik: A Parallel Dataset for Hindi-Sanskrit Machine Translation

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the scarcity of modern-domain parallel data for Hindi–Sanskrit machine translation, as existing corpora predominantly rely on classical texts and lack coverage of contemporary usage. To bridge this gap, the authors present the first large-scale, systematically constructed Hindi–Sanskrit parallel corpus comprising 92,196 sentence pairs, sourced from modern materials such as spoken-language tutorials, children’s magazines, and broadcast dialogues. Through multi-source collection, rigorous alignment, and careful cleaning, the resulting dataset exhibits substantial lexical and semantic divergence from prior resources. Fine-tuning state-of-the-art models—including ByT5, NLLB, and IndicTrans-v2—on this corpus yields significant performance gains on an in-domain test set while maintaining competitive results on general benchmarks, thereby establishing a new standard for Hindi–Sanskrit neural machine translation.

Technology Category

Application Category

📝 Abstract

We release Samasāmayik, a novel, meticulously curated, large-scale Hindi-Sanskrit corpus, comprising 92,196 parallel sentences. Unlike most data available in Sanskrit, which focuses on classical era text and poetry, this corpus aggregates data from diverse sources covering contemporary materials, including spoken tutorials, children's magazines, radio conversations, and instruction materials. We benchmark this new dataset by fine-tuning three complementary models - ByT5, NLLB and IndicTrans-v2, to demonstrate its utility. Our experiments demonstrate that models trained on the Samasamayik corpus achieve significant performance gains on in-domain test data, while achieving comparable performance on other widely used test sets, establishing a strong new performance baseline for contemporary Hindi-Sanskrit translation. Furthermore, a comparative analysis against existing corpora reveals minimal semantic and lexical overlap, confirming the novelty and non-redundancy of our dataset as a robust new resource for low-resource Indic language MT.

Problem

Research questions and friction points this paper is trying to address.

Hindi-Sanskrit machine translation

parallel corpus

low-resource languages

contemporary Sanskrit

Indic NLP

Innovation

Methods, ideas, or system contributions that make the work stand out.

parallel corpus

Hindi-Sanskrit machine translation

low-resource MT