🤖 AI Summary
Chinese word segmentation lacks explicit word boundaries, introducing inherent ambiguity that profoundly impacts dependency parsing. This paper systematically investigates how alternative word boundary definitions affect dependency structures, using the Chinese GSD treebank. It establishes, for the first time, interpretable correlations between segmentation boundaries and both dependency arc distributions and syntactic depth. Through controlled multi-scheme experiments, we quantitatively demonstrate that boundary definitions significantly alter dependency relation distributions and tree complexity (e.g., average dependency length and maximum depth). To support fine-grained linguistic analysis, we develop an interactive visualization tool (built with D3.js and React) enabling real-time comparison of structural differences across segmentation schemes and facilitating linguistic attribution. Our findings provide theoretical grounding and empirical evidence for joint segmentation–parsing modeling, advancing scientifically rigorous evaluation of word-unit selection in Chinese NLP.
📝 Abstract
Chinese word segmentation is a foundational task in natural language processing (NLP), with far-reaching effects on syntactic analysis. Unlike alphabetic languages like English, Chinese lacks explicit word boundaries, making segmentation both necessary and inherently ambiguous. This study highlights the intricate relationship between word segmentation and syntactic parsing, providing a clearer understanding of how different segmentation strategies shape dependency structures in Chinese. Focusing on the Chinese GSD treebank, we analyze multiple word boundary schemes, each reflecting distinct linguistic and computational assumptions, and examine how they influence the resulting syntactic structures. To support detailed comparison, we introduce an interactive web-based visualization tool that displays parsing outcomes across segmentation methods.