Linear Semantic Segmentation for Low-Resource Spoken Dialects

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

172K/year
🤖 AI Summary
This study addresses the poor performance of existing semantic segmentation models on low-resource spoken dialects, such as Arabic vernaculars, which exhibit informal syntax, code-switching, and weak discourse structure. To tackle this challenge, the authors introduce the first multi-genre benchmark for Arabic dialectal semantic segmentation, encompassing telephone conversations, podcasts, broadcast news, and fictional dialogues. They propose a linear segmentation model that explicitly models local semantic coherence to enhance robustness against discourse discontinuities. Experimental results demonstrate that the proposed approach significantly outperforms strong baselines on non-news dialectal corpora, substantially improving generalization in low-resource spoken-language scenarios. The framework offers a transferable solution applicable to other under-resourced spoken languages facing similar linguistic complexities.
📝 Abstract
Semantic segmentation is a core component of discourse analysis, yet existing models are primarily developed and evaluated on high-resource written text, limiting their effectiveness on low-resource spoken varieties. In particular, dialectal Arabic exhibits informal syntax, code-switching, and weakly marked discourse structure that challenge standard segmentation approaches. In this paper, we introduce a new multi-genre benchmark (more than 1000 samples) for semantic segmentation in conversational Arabic, focusing on dialectal discourse. The benchmark covers transcribed casual telephone conversations, code-switched podcasts, broadcast news, and expressive dialogue from novels, and was annotated and validated by native Arabic annotators. Using this benchmark, we show that segmentation models performing well on MSA news genres degrade on dialectal transcribed speech. We further propose a segmentation model that targets local semantic coherence and robustness to discourse discontinuities, consistently outperforming strong baselines on dialectal non-news genres. The benchmark and approach generalize to other low-resource spoken languages.
Problem

Research questions and friction points this paper is trying to address.

semantic segmentation
low-resource spoken dialects
dialectal Arabic
discourse analysis
code-switching
Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic segmentation
low-resource spoken dialects
dialectal Arabic
discourse discontinuities
multi-genre benchmark
🔎 Similar Papers
No similar papers found.