Benchmarking Contextual and Paralinguistic Reasoning in Speech-LLMs: A Case Study with In-the-Wild Data

📅 2025-09-20

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Current speech-language large language models (LLMs) exhibit significant deficiencies in paralinguistic understanding—such as emotion, prosody, and other nonverbal cues—hindering their social and affective intelligence. To address this gap, we introduce CP-Bench, the first systematic benchmark for context-aware paralinguistic reasoning, featuring realistic tasks that jointly model linguistic content and nonverbal signals. We construct two novel question-answering datasets requiring integrated linguistic and emotional comprehension, enabling comprehensive evaluation of leading open- and closed-source speech LLMs, including ablation studies on temperature parameter effects. Experimental results reveal pervasive weaknesses in empathic reasoning across all models, with even state-of-the-art systems exhibiting critical limitations. This work provides the first quantitative characterization of the paralinguistic reasoning capabilities—and fundamental boundaries—of speech LLMs, establishing an empirical foundation and concrete improvement pathways for modeling and optimizing affectively intelligent dialogue systems.

Technology Category

Application Category

📝 Abstract

Recent speech-LLMs have shown impressive performance in tasks like transcription and translation, yet they remain limited in understanding the paralinguistic aspects of speech crucial for social and emotional intelligence. We propose CP-Bench, a benchmark for evaluating speech-LLMs on contextual paralinguistic reasoning the integration of verbal content with non-verbal cues like emotion and prosody. The benchmark includes two curated question answering (QA) datasets requiring both linguistic and empathetic understanding. We evaluate state-of-the-art speech-LLMs from both open and closed-source models and perform a comprehensive analysis across different question types. The top two models were further analyzed under temperature tuning to understand its effect on this task. Our benchmark reveals a key gap in existing evaluations and offers insights into building more context-aware and emotionally intelligent speech-capable LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating speech-LLMs' understanding of paralinguistic cues like emotion

Assessing integration of verbal content with non-verbal contextual information

Identifying limitations in current speech-LLM social and emotional intelligence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed CP-Bench benchmark for contextual paralinguistic reasoning

Evaluated speech-LLMs using curated question answering datasets

Analyzed model performance under temperature tuning variations

🔎 Similar Papers

RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues