🤖 AI Summary
Existing end-to-end spoken dialogue systems struggle to model paralinguistic cues—such as prosody and emotion—and the inherent discrepancies between spoken and written language, leading to a misalignment between evaluation metrics and actual conversational quality. To address this, this work proposes SDiaReward, the first end-to-end multi-turn spoken dialogue reward model explicitly designed to capture modality-specific and orality-related gaps. Trained via pairwise preference learning directly on full-length spoken dialogue segments, SDiaReward leverages a newly curated dataset, SDiaReward-Dataset, and a hierarchical benchmark, ESDR-Bench, to enable fine-grained, end-to-end evaluation of spoken dialogues. Experimental results demonstrate that SDiaReward achieves state-of-the-art performance in preference prediction accuracy, substantially outperforming general-purpose audio large language models, while exhibiting deep understanding of expressive dialogue characteristics and strong cross-domain generalization capabilities.
📝 Abstract
The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps. It operates directly on full multi-turn speech episodes and is optimized with pairwise preference supervision, enabling joint assessment of modality and colloquialness in a single evaluator. We further establish ESDR-Bench, a stratified benchmark for robust episode-level evaluation. Experiments demonstrate that SDiaReward achieves state-of-the-art pairwise preference accuracy, significantly outperforming general-purpose audio LLMs. Further analysis suggests that SDiaReward captures relative conversational expressiveness beyond superficial synthesis cues, improving generalization across domains and recording conditions. Code, data, and demos are available at https://sdiareward.github.io/.