CharSS: Character-Level Transformer Model for Sanskrit Word Segmentation

πŸ“… 2024-07-08
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the ambiguous word boundaries induced by sandhi in Sanskrit tokenization and the lack of high-quality segmentation support for technical terminology translation into low-resource Indian languages, this paper proposes the first character-level Transformer-based end-to-end Sanskrit tokenizer. The model integrates sandhi-aware modeling with a sequence labeling framework and is the first to directly apply precise Sanskrit tokenization to Sanskritβ†’low-resource Indian language technical term analogical translation. Leveraging multi-dataset joint training and transfer learning, and evaluated using chrF++, it achieves a 6.72-percentage-point improvement in tokenization accuracy on the UoH+SandhiKosh dataset and a 2.27-percentage-point gain in perfect-match rate on the Hackathon dataset. For technical term translation, chrF++ scores increase by 6.79–8.46 points on average. This work significantly advances Sanskrit NLP and cross-lingual terminology transfer research.

Technology Category

Application Category

πŸ“ Abstract
Subword tokens in Indian languages inherently carry meaning, and isolating them can enhance NLP tasks, making sub-word segmentation a crucial process. Segmenting Sanskrit and other Indian languages into subtokens is not straightforward, as it may include sandhi, which may lead to changes in the word boundaries. We propose a new approach of utilizing a Character-level Transformer model for Sanskrit Word Segmentation (CharSS). We perform experiments on three benchmark datasets to compare the performance of our method against existing methods. On the UoH+SandhiKosh dataset, our method outperforms the current state-of-the-art system by an absolute gain of 6.72 points in split prediction accuracy. On the hackathon dataset, our method achieves a gain of 2.27 points over the current SOTA system in terms of perfect match metric. We also propose a use-case of Sanskrit-based segments for a linguistically informed translation of technical terms to lexically similar low-resource Indian languages. In two separate experimental settings for this task, we achieve an average improvement of 8.46 and 6.79 chrF++ scores, respectively.
Problem

Research questions and friction points this paper is trying to address.

Translate technical terms into low-resource Indian languages
Leverage Sanskrit-based segments for linguistically informed translation
Improve accuracy using subword-level similarity and morphological alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sanskrit-based segments for informed translation
Character-level Transformer for Sanskrit segmentation
Subword-level similarity for accurate translation
πŸ”Ž Similar Papers
No similar papers found.