WavReward: Spoken Dialogue Models With Generalist Reward Evaluators

📅 2025-05-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current speech dialogue models (e.g., GPT-4o-audio) lack reliable evaluation methods that jointly assess cognitive and affective capabilities, as conventional text-based metrics fail to capture non-lexical information inherent in spoken interaction. To address this, we propose WavReward—the first audio-native, general-purpose reward evaluator—integrating speech-semantic joint modeling with a novel nine-dimensional acoustic attribute framework. WavReward introduces a nonlinear reward mechanism and a multi-sample reinforcement feedback paradigm. We further construct ChatReward-30K, the first multidimensional speech dialogue preference dataset. Experiments demonstrate that WavReward boosts the objective accuracy of Qwen2.5-Omni to 91.5% (+36.4 percentage points) and achieves an 83% subjective A/B win rate, significantly outperforming state-of-the-art evaluators.

Technology Category

Application Category

📝 Abstract
End-to-end spoken dialogue models such as GPT-4o-audio have recently garnered significant attention in the speech domain. However, the evaluation of spoken dialogue models' conversational performance has largely been overlooked. This is primarily due to the intelligent chatbots convey a wealth of non-textual information which cannot be easily measured using text-based language models like ChatGPT. To address this gap, we propose WavReward, a reward feedback model based on audio language models that can evaluate both the IQ and EQ of spoken dialogue systems with speech input. Specifically, 1) based on audio language models, WavReward incorporates the deep reasoning process and the nonlinear reward mechanism for post-training. By utilizing multi-sample feedback via the reinforcement learning algorithm, we construct a specialized evaluator tailored to spoken dialogue models. 2) We introduce ChatReward-30K, a preference dataset used to train WavReward. ChatReward-30K includes both comprehension and generation aspects of spoken dialogue models. These scenarios span various tasks, such as text-based chats, nine acoustic attributes of instruction chats, and implicit chats. WavReward outperforms previous state-of-the-art evaluation models across multiple spoken dialogue scenarios, achieving a substantial improvement about Qwen2.5-Omni in objective accuracy from 55.1$%$ to 91.5$%$. In subjective A/B testing, WavReward also leads by a margin of 83$%$. Comprehensive ablation studies confirm the necessity of each component of WavReward. All data and code will be publicly at https://github.com/jishengpeng/WavReward after the paper is accepted.
Problem

Research questions and friction points this paper is trying to address.

Evaluating conversational performance of spoken dialogue models
Measuring non-textual information in spoken dialogue systems
Developing a reward model for IQ and EQ evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio language model for reward feedback
Multi-sample reinforcement learning evaluator
ChatReward-30K dataset for training
🔎 Similar Papers
No similar papers found.
S
Shengpeng Ji
Zhejiang University & Alibaba Group
T
Tianle Liang
Zhejiang University & Alibaba Group
Y
Yangzhuo Li
Zhejiang University & Alibaba Group
Jialong Zuo
Jialong Zuo
Zhejiang University
Speech SynthesisVoice Conversion
Minghui Fang
Minghui Fang
Zhejiang University
SpeechMulti-Modal LearningInformation Retrieval
Jinzheng He
Jinzheng He
Alibaba Qwen Team, Zhejiang University
Omni LLMPost-TrainingRL
Y
Yifu Chen
Zhejiang University & Alibaba Group
Z
Zhengqing Liu
Zhejiang University & Alibaba Group
Ziyue Jiang
Ziyue Jiang
Zhejiang University
Speech Synthesis
X
Xize Cheng
Zhejiang University & Alibaba Group
S
Siqi Zheng
Zhejiang University & Alibaba Group
J
Jin Xu
Alibaba Group
Junyang Lin
Junyang Lin
Qwen Team, Alibaba Group & Peking University
Natural Language ProcessingCross-Modal Representation LearningPretraining
Zhou Zhao
Zhou Zhao
Zhejiang University
Machine LearningData MiningMultimedia Computing