WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluation practices for spoken dialogue systems largely rely on text generation metrics, which fail to capture the unique challenges posed by reasoning capabilities, colloquial expression, and paralinguistic features such as intonation and pauses. To address this gap, this work proposes WavBench, a novel benchmark that introduces a three-dimensional evaluation framework encompassing complex reasoning (Pro), natural spoken interaction (Basic), and paralinguistic understanding and generation (Acoustic), with an emphasis on audibility and authentic conversational contexts. Through the construction of multidimensional datasets and the release of an open-source evaluation toolkit, comprehensive assessments of five state-of-the-art models reveal significant deficiencies across these dimensions, thereby offering clear guidance and effective tools for advancing spoken dialogue system research.

Technology Category

Application Category

📝 Abstract
With the rapid integration of advanced reasoning capabilities into spoken dialogue models, the field urgently demands benchmarks that transcend simple interactions to address real-world complexity. However, current evaluations predominantly adhere to text-generation standards, overlooking the unique audio-centric characteristics of paralinguistics and colloquialisms, alongside the cognitive depth required by modern agents. To bridge this gap, we introduce WavBench, a comprehensive benchmark designed to evaluate realistic conversational abilities where prior works fall short. Uniquely, WavBench establishes a tripartite framework: 1) Pro subset, designed to rigorously challenge reasoning-enhanced models with significantly increased difficulty; 2) Basic subset, defining a novel standard for spoken colloquialism that prioritizes"listenability"through natural vocabulary, linguistic fluency, and interactive rapport, rather than rigid written accuracy; and 3) Acoustic subset, covering explicit understanding, generation, and implicit dialogue to rigorously evaluate comprehensive paralinguistic capabilities within authentic real-world scenarios. Through evaluating five state-of-the-art models, WavBench offers critical insights into the intersection of complex problem-solving, colloquial delivery, and paralinguistic fidelity, guiding the evolution of robust spoken dialogue models. The benchmark dataset and evaluation toolkit are available at https://naruto-2024.github.io/wavbench.github.io/.
Problem

Research questions and friction points this paper is trying to address.

spoken dialogue models
benchmarking
paralinguistics
colloquialism
reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

spoken dialogue models
paralinguistics
colloquialism
reasoning benchmark
audio-centric evaluation
🔎 Similar Papers
No similar papers found.
Y
Yangzhuo Li
Xiamen University
S
Shengpeng Ji
Zhejiang University
Y
Yifu Chen
Zhejiang University
T
Tianle Liang
Zhejiang University
H
Haorong Ying
Xiamen University
Yule Wang
Yule Wang
Georgia Institute of Technology
Generative ModelingComputational NeuroscienceData Mining
Junbo Li
Junbo Li
University of Texas at Austin
agentic reasoning LLMreinforcement learning
J
Jun Fang
Zhejiang University
Zhou Zhao
Zhou Zhao
Zhejiang University
Machine LearningData MiningMultimedia Computing