Mahānāma: A Unique Testbed for Literary Entity Discovery and Linking

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Sanskrit literary texts pose significant challenges for entity resolution—including rich morphological variation, referential ambiguity, and long-distance dependencies—causing existing coreference resolution and entity linking models to underperform on complex epics such as the *Mahābhārata*. Method: We introduce the first end-to-end, large-scale Named Entity Discovery and Linking dataset for Sanskrit literature, covering over 109,000 mentions from the *Mahābhārata*, mapped to 5,500 unique entities and aligned with English knowledge bases to enable cross-lingual linking. We further propose the first literary entity linking benchmark tailored for morphologically complex, low-resource languages. Contribution/Results: Leveraging a narrative-structure-driven annotation schema, our empirical evaluation exposes critical limitations of state-of-the-art models in global context modeling and high-ambiguity scenarios, confirming the dataset’s utility as a challenging benchmark. This work establishes a foundational resource and a new evaluation platform for computational literary studies and cross-lingual entity understanding.

Technology Category

Application Category

📝 Abstract

High lexical variation, ambiguous references, and long-range dependencies make entity resolution in literary texts particularly challenging. We present Mahānāma, the first large-scale dataset for end-to-end Entity Discovery and Linking (EDL) in Sanskrit, a morphologically rich and under-resourced language. Derived from the Mahābhārata, the world's longest epic, the dataset comprises over 109K named entity mentions mapped to 5.5K unique entities, and is aligned with an English knowledge base to support cross-lingual linking. The complex narrative structure of Mahānāma, coupled with extensive name variation and ambiguity, poses significant challenges to resolution systems. Our evaluation reveals that current coreference and entity linking models struggle when evaluated on the global context of the test set. These results highlight the limitations of current approaches in resolving entities within such complex discourse. Mahānāma thus provides a unique benchmark for advancing entity resolution, especially in literary domains.

Problem

Research questions and friction points this paper is trying to address.

Addressing entity resolution challenges in literary texts with complex narrative structures

Developing EDL methods for Sanskrit, a morphologically rich under-resourced language

Solving name variation and ambiguity issues in long literary works like Mahābhārata

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created large-scale Sanskrit dataset for EDL

Aligned entities with English knowledge base

Evaluated coreference models on complex narrative

🔎 Similar Papers

LLMAEL: Large Language Models are Good Context Augmenters for Entity Linking