Parsing Through Boundaries in Chinese Word Segmentation

📅 2025-03-29

📈 Citations: 0

✨ Influential: 0

career value

138K/year

🤖 AI Summary

Chinese word segmentation lacks explicit word boundaries, introducing inherent ambiguity that profoundly impacts dependency parsing. This paper systematically investigates how alternative word boundary definitions affect dependency structures, using the Chinese GSD treebank. It establishes, for the first time, interpretable correlations between segmentation boundaries and both dependency arc distributions and syntactic depth. Through controlled multi-scheme experiments, we quantitatively demonstrate that boundary definitions significantly alter dependency relation distributions and tree complexity (e.g., average dependency length and maximum depth). To support fine-grained linguistic analysis, we develop an interactive visualization tool (built with D3.js and React) enabling real-time comparison of structural differences across segmentation schemes and facilitating linguistic attribution. Our findings provide theoretical grounding and empirical evidence for joint segmentation–parsing modeling, advancing scientifically rigorous evaluation of word-unit selection in Chinese NLP.

Technology Category

Application Category

📝 Abstract

Chinese word segmentation is a foundational task in natural language processing (NLP), with far-reaching effects on syntactic analysis. Unlike alphabetic languages like English, Chinese lacks explicit word boundaries, making segmentation both necessary and inherently ambiguous. This study highlights the intricate relationship between word segmentation and syntactic parsing, providing a clearer understanding of how different segmentation strategies shape dependency structures in Chinese. Focusing on the Chinese GSD treebank, we analyze multiple word boundary schemes, each reflecting distinct linguistic and computational assumptions, and examine how they influence the resulting syntactic structures. To support detailed comparison, we introduce an interactive web-based visualization tool that displays parsing outcomes across segmentation methods.

Problem

Research questions and friction points this paper is trying to address.

Explores Chinese word segmentation's impact on syntactic parsing

Analyzes boundary schemes affecting dependency structures in Chinese

Develops visualization tool to compare segmentation method outcomes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes multiple Chinese word boundary schemes

Examines impact on syntactic dependency structures

Introduces interactive visualization tool for comparisons

🔎 Similar Papers

Unsupervised Morphological Tree Tokenizer