🤖 AI Summary
This work addresses the automatic segmentation of Elementary Discourse Units (EDUs) in discourse structure analysis. We propose a lightweight, feature-driven Random Forest classifier that relies solely on shallow linguistic features—such as lexical cues and character-level n-grams—to directly predict EDU boundaries, eschewing complex neural architectures. Experimental results demonstrate state-of-the-art performance on EDU segmentation across standard benchmarks and yield significant improvements in downstream discourse parsing accuracy for leading parsers—including DRParser and SegBot. Our core contribution is the first systematic empirical validation that low-complexity features and models achieve strong efficacy in discourse analysis, challenging the prevailing reliance on deep learning. This finding establishes a new paradigm for efficient, resource-conscious discourse parsing—particularly valuable in low-resource or latency-sensitive applications—while maintaining competitive accuracy.
📝 Abstract
Segmenting text into Elemental Discourse Units (EDUs) is a fundamental task in discourse parsing. We present a new simple method for identifying EDU boundaries, and hence segmenting them, based on lexical and character n-gram features, using random forest classification. We show that the method, despite its simplicity, outperforms other methods both for segmentation and within a state of the art discourse parser. This indicates the importance of such features for identifying basic discourse elements, pointing towards potentially more training-efficient methods for discourse analysis.