Disentangling Reasoning in Large Audio-Language Models for Ambiguous Emotion Prediction

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the limitation of traditional speech emotion recognition methods, which typically output a single emotion label and fail to capture the inherent ambiguity of human affect. For the first time, it systematically investigates the reasoning capabilities of large audio language models in ambiguous emotion recognition by reframing the task as emotional distribution prediction. The authors propose a perception-aligned training objective and a structured chain-of-thought supervision mechanism. By integrating strategies such as supervised fine-tuning (SFT), direct preference optimization (DPO), and group relative policy optimization (GRPO), along with a distribution-matching loss and chain-based reasoning guidance, the model achieves significantly enhanced understanding of ambiguous emotional states. Consistent performance gains on both IEMOCAP and CREMA-D datasets validate the effectiveness and novelty of the proposed approach.

Technology Category

Application Category

📝 Abstract

Speech emotion recognition plays an important role in various applications. However, most existing approaches predict a single emotion label, oversimplifying the inherently ambiguous nature of human emotional expression. Recent large audio-language models show promise in generating richer outputs, but their reasoning ability for ambiguous emotional understanding remains limited. In this work, we reformulate ambiguous emotion recognition as a distributional reasoning problem and present the first systematic study of ambiguity-aware reasoning in LALMs. Our framework comprises two complementary components: an ambiguity-aware objective that aligns predictions with human perceptual distributions, and a structured ambiguity-aware chain-of-thought supervision that guides reasoning over emotional cues. Experiments on IEMOCAP and CREMA-D demonstrate consistent improvements across SFT, DPO, and GRPO training strategies.

Problem

Research questions and friction points this paper is trying to address.

speech emotion recognition

ambiguous emotion

large audio-language models

emotional ambiguity

distributional reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

ambiguous emotion recognition

large audio-language models

distributional reasoning