Samasāmayik: A Parallel Dataset for Hindi-Sanskrit Machine Translation

📅 2026-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity of modern-domain parallel data for Hindi–Sanskrit machine translation, as existing corpora predominantly rely on classical texts and lack coverage of contemporary usage. To bridge this gap, the authors present the first large-scale, systematically constructed Hindi–Sanskrit parallel corpus comprising 92,196 sentence pairs, sourced from modern materials such as spoken-language tutorials, children’s magazines, and broadcast dialogues. Through multi-source collection, rigorous alignment, and careful cleaning, the resulting dataset exhibits substantial lexical and semantic divergence from prior resources. Fine-tuning state-of-the-art models—including ByT5, NLLB, and IndicTrans-v2—on this corpus yields significant performance gains on an in-domain test set while maintaining competitive results on general benchmarks, thereby establishing a new standard for Hindi–Sanskrit neural machine translation.

Technology Category

Application Category

📝 Abstract
We release Samasāmayik, a novel, meticulously curated, large-scale Hindi-Sanskrit corpus, comprising 92,196 parallel sentences. Unlike most data available in Sanskrit, which focuses on classical era text and poetry, this corpus aggregates data from diverse sources covering contemporary materials, including spoken tutorials, children's magazines, radio conversations, and instruction materials. We benchmark this new dataset by fine-tuning three complementary models - ByT5, NLLB and IndicTrans-v2, to demonstrate its utility. Our experiments demonstrate that models trained on the Samasamayik corpus achieve significant performance gains on in-domain test data, while achieving comparable performance on other widely used test sets, establishing a strong new performance baseline for contemporary Hindi-Sanskrit translation. Furthermore, a comparative analysis against existing corpora reveals minimal semantic and lexical overlap, confirming the novelty and non-redundancy of our dataset as a robust new resource for low-resource Indic language MT.
Problem

Research questions and friction points this paper is trying to address.

Hindi-Sanskrit machine translation
parallel corpus
low-resource languages
contemporary Sanskrit
Indic NLP
Innovation

Methods, ideas, or system contributions that make the work stand out.

parallel corpus
Hindi-Sanskrit machine translation
low-resource MT
contemporary Sanskrit
dataset curation
🔎 Similar Papers
No similar papers found.
N
N J Karthika
Indian Institute of Technology Bombay
K
Keerthana Suryanarayanan
Geakminds Technologies Private Limited
J
Jahanvi Purohit
Indian Institute of Technology Bombay
Ganesh Ramakrishnan
Ganesh Ramakrishnan
Professor, Department of Computer Science and Engineering, Indian Institute of Technology Bombay
Machine LearningRelational LearningInformation ExtractionQuestion AnsweringText Analytics
J
Jitin Singla
Indian Institute of Technology Roorkee
A
Anil Kumar Gourishetty
Indian Institute of Technology Roorkee