Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing end-to-end spoken dialogue systems struggle to model paralinguistic cues—such as prosody and emotion—and the inherent discrepancies between spoken and written language, leading to a misalignment between evaluation metrics and actual conversational quality. To address this, this work proposes SDiaReward, the first end-to-end multi-turn spoken dialogue reward model explicitly designed to capture modality-specific and orality-related gaps. Trained via pairwise preference learning directly on full-length spoken dialogue segments, SDiaReward leverages a newly curated dataset, SDiaReward-Dataset, and a hierarchical benchmark, ESDR-Bench, to enable fine-grained, end-to-end evaluation of spoken dialogues. Experimental results demonstrate that SDiaReward achieves state-of-the-art performance in preference prediction accuracy, substantially outperforming general-purpose audio large language models, while exhibiting deep understanding of expressive dialogue characteristics and strong cross-domain generalization capabilities.

Technology Category

Application Category

📝 Abstract
The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps. It operates directly on full multi-turn speech episodes and is optimized with pairwise preference supervision, enabling joint assessment of modality and colloquialness in a single evaluator. We further establish ESDR-Bench, a stratified benchmark for robust episode-level evaluation. Experiments demonstrate that SDiaReward achieves state-of-the-art pairwise preference accuracy, significantly outperforming general-purpose audio LLMs. Further analysis suggests that SDiaReward captures relative conversational expressiveness beyond superficial synthesis cues, improving generalization across domains and recording conditions. Code, data, and demos are available at https://sdiareward.github.io/.
Problem

Research questions and friction points this paper is trying to address.

modality
colloquialness
spoken dialogue systems
paralinguistic
reward modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

spoken dialogue reward modeling
modality gap
colloquialness
pairwise preference learning
end-to-end speech evaluation
🔎 Similar Papers
No similar papers found.
J
Jingyu Lu
Zhejiang University
Yuhan Wang
Yuhan Wang
Unknown affiliation
F
Fan Zhuo
Zhejiang University
X
Xize Cheng
Zhejiang University
Changhao Pan
Changhao Pan
Zhejiang University
Multi-Modal Genarative AISinging Voice Synthesis
X
Xueyi Pu
Zhejiang University
Y
Yifu Chen
Zhejiang University
C
Chenyuhao Wen
Zhejiang University
T
Tianle Liang
Zhejiang University
Zhou Zhao
Zhou Zhao
Zhejiang University
Machine LearningData MiningMultimedia Computing