C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

📅 2025-07-30

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing research lacks systematic evaluation of Spoken Dialogue Models (SDMs) under speech-specific challenges—including phonological ambiguity (e.g., homophones, polysemy, prosodic stress) and context dependency (e.g., ellipsis, coreference, multi-turn interaction). This paper introduces the first bilingual (Chinese–English) benchmark for complex spoken dialogue evaluation, comprising 1,079 high-quality multi-turn instances. We innovatively define speech-aware challenge dimensions and propose an automated evaluation framework powered by large language models (LLMs), integrating human annotations with LLM-based judgment to emulate human assessment of semantic, phonological, and contextual coherence in model responses. Experiments demonstrate strong agreement between our framework and human judgments (Spearman’s ρ > 0.85) and reveal critical capability bottlenecks of current SDMs in realistic, linguistically complex spoken dialogue scenarios.

Technology Category

Application Category

📝 Abstract

Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users' spoken queries. Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations. This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking. Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns. Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics. To illuminate the current state of SDM development and to address these challenges, we present a benchmark dataset in this paper, which comprises 1,079 instances in English and Chinese. Accompanied by an LLM-based evaluation method that closely aligns with human judgment, this dataset facilitates a comprehensive exploration of the performance of SDMs in tackling these practical challenges.

Problem

Research questions and friction points this paper is trying to address.

Assessing SDMs' ability to comprehend complex human conversations

Addressing ambiguity in spoken dialogue from semantic and phonological factors

Evaluating context-dependency challenges like omission and multi-turn interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilingual benchmark dataset for SDMs

LLM-based evaluation method

Focus on spoken dialogue complexity

🔎 Similar Papers

No similar papers found.