Transformer-Enabled Diachronic Analysis of Vedic Sanskrit: Neural Methods for Quantifying Types of Language Change

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This study investigates diachronic morphological complexity change in Vedic Sanskrit—a highly inflectional language with scarce annotated resources—over its 2,000-year history, challenging the traditional “language evolution as simplification” hypothesis. Method: We propose a neuro-symbolic framework integrating regex-based pseudo-labeling, multilingual BERT fine-tuning, and weakly supervised learning, augmented by a confidence-weighted fusion mechanism to enable interpretable and scalable linguistic change detection. Contribution/Results: Evaluated on a 1.47-million-word diachronic corpus, our model achieves a feature detection rate of 52.4% and yields well-calibrated uncertainty estimates (Pearson’s r = 0.92; Expected Calibration Error = 0.043). Crucially, we provide the first empirical evidence that Sanskrit’s morphological complexity does not undergo unidirectional simplification but rather dynamic redistribution: compound formation and philosophical terminology exhibit marked growth, reflecting systematic recomplexification.

Technology Category

Application Category

📝 Abstract

This study demonstrates how hybrid neural-symbolic methods can yield significant new insights into the evolution of a morphologically rich, low-resource language. We challenge the naive assumption that linguistic change is simplification by quantitatively analyzing over 2,000 years of Sanskrit, demonstrating how weakly-supervised hybrid methods can yield new insights into the evolution of morphologically rich, low-resource languages. Our approach addresses data scarcity through weak supervision, using 100+ high-precision regex patterns to generate pseudo-labels for fine-tuning a multilingual BERT. We then fuse symbolic and neural outputs via a novel confidence-weighted ensemble, creating a system that is both scalable and interpretable. Applying this framework to a 1.47-million-word diachronic corpus, our ensemble achieves a 52.4% overall feature detection rate. Our findings reveal that Sanskrit's overall morphological complexity does not decrease but is instead dynamically redistributed: while earlier verbal features show cyclical patterns of decline, complexity shifts to other domains, evidenced by a dramatic expansion in compounding and the emergence of new philosophical terminology. Critically, our system produces well-calibrated uncertainty estimates, with confidence strongly correlating with accuracy (Pearson r = 0.92) and low overall calibration error (ECE = 0.043), bolstering the reliability of these findings for computational philology.

Problem

Research questions and friction points this paper is trying to address.

Analyzing language evolution in morphologically rich, low-resource Vedic Sanskrit.

Addressing data scarcity via weak supervision and hybrid neural-symbolic methods.

Quantifying linguistic change to challenge simplification assumptions in Sanskrit.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Weak supervision with regex patterns for pseudo-labeling

Confidence-weighted ensemble fusing symbolic and neural outputs

Calibrated uncertainty estimates ensuring reliable diachronic analysis

🔎 Similar Papers

CharSS: Character-Level Transformer Model for Sanskrit Word Segmentation