Mitrasamgraha: A Comprehensive Classical Sanskrit Machine Translation Dataset

📅 2026-01-12
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Classical Sanskrit texts are rich in poetic expression, philosophical depth, and intricate linguistic structures, yet the scarcity of high-quality, multi-domain Sanskrit–English parallel corpora spanning three millennia has long hindered machine translation research. This work introduces Mitrasamgraha, a meticulously curated dataset comprising 391,548 human-verified sentence pairs drawn from ritual, epic, philosophical, poetic, and scientific domains, featuring fine-grained temporal and domain annotations for the first time. Fine-tuning state-of-the-art models such as NLLB and Gemma on this dataset yields substantial improvements in translation performance, demonstrating its utility. Nevertheless, the study also reveals persistent challenges in accurately translating complex compounds, nuanced philosophical concepts, and layered metaphors, underscoring the need for further advances in handling Sanskrit’s linguistic and semantic complexity.

Technology Category

Application Category

📝 Abstract
While machine translation is regarded as a"solved problem"for many high-resource languages, close analysis quickly reveals that this is not the case for content that shows challenges such as poetic language, philosophical concepts, multi-layered metaphorical expressions, and more. Sanskrit literature is a prime example of this, as it combines a large number of such challenges in addition to inherent linguistic features like sandhi, compounding, and heavy morphology, which further complicate NLP downstream tasks. It spans multiple millennia of text production time as well as a large breadth of different domains, ranging from ritual formulas via epic narratives, philosophical treatises, poetic verses up to scientific material. As of now, there is a strong lack of publicly available resources that cover these different domains and temporal layers of Sanskrit. We therefore introduce Mitrasamgraha, a high-quality Sanskrit-to-English machine translation dataset consisting of 391,548 bitext pairs, more than four times larger than the largest previously available Sanskrit dataset Itih=asa. It covers a time period of more than three millennia and a broad range of historical Sanskrit domains. In contrast to web-crawled datasets, the temporal and domain annotation of this dataset enables fine-grained study of domain and time period effects on MT performance. We also release a validation set consisting of 5,587 and a test set consisting of 5,552 post-corrected bitext pairs. We conduct experiments benchmarking commercial and open models on this dataset and fine-tune NLLB and Gemma models on the dataset, showing significant improvements, while still recognizing significant challenges in the translation of complex compounds, philosophical concepts, and multi-layered metaphors. We also analyze how in-context learning on this dataset impacts the performance of commercial models
Problem

Research questions and friction points this paper is trying to address.

Sanskrit machine translation
poetic language
philosophical concepts
multi-layered metaphors
morphological complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sanskrit machine translation
multidomain dataset
temporal annotation
morphological complexity
in-context learning
🔎 Similar Papers
No similar papers found.