HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

180K/year
🤖 AI Summary
Current evaluations of emotional intelligence in audio language models predominantly rely on synthetic speech, single-turn interactions, and subjective scoring, which fail to capture empathetic capabilities in authentic multi-turn dialogues. This work introduces the first benchmark for emotional intelligence grounded in real human multi-turn conversational recordings. It innovatively incorporates adversarial multiple-choice questions to quantify models’ abilities in emotional tracking and causal reasoning, and designs an acoustic-semantic conflict task to assess empathetic consistency under multimodal contradictions, while retaining an empathetic response generation component. Experimental results reveal that prevailing audio language models exhibit significant weaknesses in multi-turn emotional understanding and implicit reasoning, alongside a pronounced text-dominance bias and decoupling between acoustic and semantic modalities.

Technology Category

Application Category

📝 Abstract
Evaluating the emotional intelligence (EI) of audio language models (ALMs) is critical. However, existing benchmarks mostly rely on synthesized speech, are limited to single-turn interactions, and depend heavily on open-ended scoring. This paper proposes HumDial-EIBench, a comprehensive benchmark for evaluating ALMs' EI. Using real-recorded human dialogues from the ICASSP 2026 HumDial Challenge, it reformulates emotional tracking and causal reasoning into multiple-choice questions with adversarial distractors, mitigating subjective scoring bias for cognitive tasks. It retains the generation of empathetic responses and introduces an acoustic-semantic conflict task to assess robustness against contradictory multimodal signals. Evaluations of eight ALMs reveal that most models struggle with multi-turn emotional tracking and implicit causal reasoning. Furthermore, all models exhibit decoupled textual and acoustic empathy, alongside a severe text-dominance bias during cross-modal conflicts.
Problem

Research questions and friction points this paper is trying to address.

emotional intelligence
audio language models
multi-turn dialogue
benchmark evaluation
multimodal conflict
Innovation

Methods, ideas, or system contributions that make the work stand out.

Emotional Intelligence
Audio Language Models
Human-Recorded Dialogues
Multimodal Conflict
Adversarial Distractors
🔎 Similar Papers
No similar papers found.