CharSS: Character-Level Transformer Model for Sanskrit Word Segmentation

📅 2024-07-08

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

128K/year

🤖 AI Summary

To address the ambiguous word boundaries induced by sandhi in Sanskrit tokenization and the lack of high-quality segmentation support for technical terminology translation into low-resource Indian languages, this paper proposes the first character-level Transformer-based end-to-end Sanskrit tokenizer. The model integrates sandhi-aware modeling with a sequence labeling framework and is the first to directly apply precise Sanskrit tokenization to Sanskrit→low-resource Indian language technical term analogical translation. Leveraging multi-dataset joint training and transfer learning, and evaluated using chrF++, it achieves a 6.72-percentage-point improvement in tokenization accuracy on the UoH+SandhiKosh dataset and a 2.27-percentage-point gain in perfect-match rate on the Hackathon dataset. For technical term translation, chrF++ scores increase by 6.79–8.46 points on average. This work significantly advances Sanskrit NLP and cross-lingual terminology transfer research.

Technology Category

Application Category

📝 Abstract

Subword tokens in Indian languages inherently carry meaning, and isolating them can enhance NLP tasks, making sub-word segmentation a crucial process. Segmenting Sanskrit and other Indian languages into subtokens is not straightforward, as it may include sandhi, which may lead to changes in the word boundaries. We propose a new approach of utilizing a Character-level Transformer model for Sanskrit Word Segmentation (CharSS). We perform experiments on three benchmark datasets to compare the performance of our method against existing methods. On the UoH+SandhiKosh dataset, our method outperforms the current state-of-the-art system by an absolute gain of 6.72 points in split prediction accuracy. On the hackathon dataset, our method achieves a gain of 2.27 points over the current SOTA system in terms of perfect match metric. We also propose a use-case of Sanskrit-based segments for a linguistically informed translation of technical terms to lexically similar low-resource Indian languages. In two separate experimental settings for this task, we achieve an average improvement of 8.46 and 6.79 chrF++ scores, respectively.

Problem

Research questions and friction points this paper is trying to address.

Translate technical terms into low-resource Indian languages

Leverage Sanskrit-based segments for linguistically informed translation

Improve accuracy using subword-level similarity and morphological alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sanskrit-based segments for informed translation

Character-level Transformer for Sanskrit segmentation

Subword-level similarity for accurate translation

🔎 Similar Papers

Unsupervised Morphological Tree Tokenizer