🤖 AI Summary
Current speech-language large language models (LLMs) exhibit significant deficiencies in paralinguistic understanding—such as emotion, prosody, and other nonverbal cues—hindering their social and affective intelligence. To address this gap, we introduce CP-Bench, the first systematic benchmark for context-aware paralinguistic reasoning, featuring realistic tasks that jointly model linguistic content and nonverbal signals. We construct two novel question-answering datasets requiring integrated linguistic and emotional comprehension, enabling comprehensive evaluation of leading open- and closed-source speech LLMs, including ablation studies on temperature parameter effects. Experimental results reveal pervasive weaknesses in empathic reasoning across all models, with even state-of-the-art systems exhibiting critical limitations. This work provides the first quantitative characterization of the paralinguistic reasoning capabilities—and fundamental boundaries—of speech LLMs, establishing an empirical foundation and concrete improvement pathways for modeling and optimizing affectively intelligent dialogue systems.
📝 Abstract
Recent speech-LLMs have shown impressive performance in tasks like transcription and translation, yet they remain limited in understanding the paralinguistic aspects of speech crucial for social and emotional intelligence. We propose CP-Bench, a benchmark for evaluating speech-LLMs on contextual paralinguistic reasoning the integration of verbal content with non-verbal cues like emotion and prosody. The benchmark includes two curated question answering (QA) datasets requiring both linguistic and empathetic understanding. We evaluate state-of-the-art speech-LLMs from both open and closed-source models and perform a comprehensive analysis across different question types. The top two models were further analyzed under temperature tuning to understand its effect on this task. Our benchmark reveals a key gap in existing evaluations and offers insights into building more context-aware and emotionally intelligent speech-capable LLMs.