EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models

📅 2025-10-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current speech-language models (SLMs) lack joint perception and empathic reasoning over non-lexical acoustic cues—such as prosody, rhythm, and emotional intonation—while prevailing benchmarks evaluate isolated capabilities, failing to reflect the multimodal integration essential for authentic empathic dialogue. Method: We propose EchoMind, the first relational, multi-level benchmark that innovatively combines semantically neutral text with controllable voice styles to construct a fine-grained empathic evaluation framework covering 39 acoustic attributes, assessed across four interdependent stages: spoken language understanding, acoustic perception, holistic reasoning, and empathic response generation. Contribution/Results: Evaluating 12 state-of-the-art SLMs reveals consistent deficiencies in expressive speech recognition and context-adaptive empathic generation, particularly in instruction following and robustness to natural speech variability.

Technology Category

Application Category

📝 Abstract
Speech Language Models (SLMs) have made significant progress in spoken language understanding. Yet it remains unclear whether they can fully perceive non lexical vocal cues alongside spoken words, and respond with empathy that aligns with both emotional and contextual factors. Existing benchmarks typically evaluate linguistic, acoustic, reasoning, or dialogue abilities in isolation, overlooking the integration of these skills that is crucial for human-like, emotionally intelligent conversation. We present EchoMind, the first interrelated, multi-level benchmark that simulates the cognitive process of empathetic dialogue through sequential, context-linked tasks: spoken-content understanding, vocal-cue perception, integrated reasoning, and response generation. All tasks share identical and semantically neutral scripts that are free of explicit emotional or contextual cues, and controlled variations in vocal style are used to test the effect of delivery independent of the transcript. EchoMind is grounded in an empathy-oriented framework spanning 3 coarse and 12 fine-grained dimensions, encompassing 39 vocal attributes, and evaluated using both objective and subjective metrics. Testing 12 advanced SLMs reveals that even state-of-the-art models struggle with high-expressive vocal cues, limiting empathetic response quality. Analyses of prompt strength, speech source, and ideal vocal cue recognition reveal persistent weaknesses in instruction-following, resilience to natural speech variability, and effective use of vocal cues for empathy. These results underscore the need for SLMs that integrate linguistic content with diverse vocal cues to achieve truly empathetic conversational ability.
Problem

Research questions and friction points this paper is trying to address.

Evaluating SLMs' ability to perceive non-lexical vocal cues with spoken words
Assessing empathetic responses aligned with emotional and contextual factors
Testing integration of linguistic, acoustic, reasoning and dialogue abilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-level benchmark simulating empathetic dialogue cognitive process
Sequential context-linked tasks with neutral scripts and vocal variations
Empathy framework with objective and subjective evaluation metrics
🔎 Similar Papers
No similar papers found.
L
Li Zhou
The Chinese University of Hong Kong, Shenzhen
L
Lutong Yu
The Chinese University of Hong Kong, Shenzhen
Y
You Lyu
The Chinese University of Hong Kong, Shenzhen
Y
Yihang Lin
The Chinese University of Hong Kong, Shenzhen
Z
Zefeng Zhao
The Chinese University of Hong Kong, Shenzhen
Junyi Ao
Junyi Ao
The Chinese University of Hong Kong, Shenzhen
Speech RecognitionSelf-Supervised Learning
Y
Yuhao Zhang
The Chinese University of Hong Kong, Shenzhen
Benyou Wang
Benyou Wang
Assistant Professor, The Chinese University of Hong Kong, Shenzhen
large language modelsnatural language processinginformation retrievalapplied machine learning
Haizhou Li
Haizhou Li
The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China; NUS, Singapore
Automatic Speech RecognitionSpeaker RecognitionLanguage RecognitionVoice ConversionMachine Translation