Semantically Cohesive Word Grouping in Indian Languages

📅 2025-01-07

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

Indian languages—exemplified by Hindi—exhibit rich morphology, agglutinative structure, and a misalignment between whitespace-based tokenization and semantic units, leading to cross-linguistic inconsistency in syntactic structure. Method: We propose a semantic-coherence-driven phrasal segmentation preprocessing method, establishing semantics-informed phrase splitting as a foundational preprocessing step for Indian-language NLP—first of its kind. Contribution/Results: Through dependency parsing analysis, intrinsic perturbation experiments (word-order scrambling), and extrinsic task evaluation (machine translation and disentangled prompting), we demonstrate that our approach significantly improves syntactic consistency and enhances model understanding of semantic units. Experiments reveal that whitespace tokenization granularity is a fundamental determinant of syntactic modeling quality and downstream performance. Our work delivers a transferable, unified preprocessing paradigm across multiple Indian languages, advancing morphology-aware NLP.

Technology Category

Application Category

📝 Abstract

Indian languages are inflectional and agglutinative and typically follow clause-free word order. The structure of sentences across most major Indian languages are similar when their dependency parse trees are considered. While some differences in the parsing structure occur due to peculiarities of a language or its preferred natural way of conveying meaning, several apparent differences are simply due to the granularity of representation of the smallest semantic unit of processing in a sentence. The semantic unit is typically a word, typographically separated by whitespaces. A single whitespace-separated word in one language may correspond to a group of words in another. Hence, grouping of words based on semantics helps unify the parsing structure of parallel sentences across languages and, in the process, morphology. In this work, we propose word grouping as a major preprocessing step for any computational or linguistic processing of sentences for Indian languages. Among Indian languages, since Hindi is one of the least agglutinative, we expect it to benefit the most from word-grouping. Hence, in this paper, we focus on Hindi to study the effects of grouping. We perform quantitative assessment of our proposal with an intrinsic method that perturbs sentences by shuffling words as well as an extrinsic evaluation that verifies the importance of word grouping for the task of Machine Translation (MT) using decomposed prompting. We also qualitatively analyze certain aspects of the syntactic structure of sentences. Our experiments and analyses show that the proposed grouping technique brings uniformity in the syntactic structures, as well as aids underlying NLP tasks.

Problem

Research questions and friction points this paper is trying to address.

Semantic Agglutination

High Structural Freedom

Inconsistent Sentence Structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Clustering

Indian Languages

Machine Translation

🔎 Similar Papers

No similar papers found.