Integrating Fine-Grained Audio-Visual Evidence for Robust Multimodal Emotion Reasoning

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Existing multimodal emotion reasoning models are hindered by data scarcity and insufficient cross-modal fusion, often suffering from modality dominance or hallucination—particularly when visual and auditory cues are ambiguous or conflicting. To address these limitations, this work introduces SABER, the first large-scale multimodal dataset enabling fine-grained causal emotion reasoning, annotated with a six-dimensional framework that jointly captures audiovisual cues and their underlying causal logic. Furthermore, the authors propose a perception-reasoning decoupled architecture combined with a consistency-aware direct preference optimization strategy to mitigate modality dominance. Experimental results demonstrate that the proposed approach significantly outperforms open-source baselines on EMER, EmoBench-M, and SABER-Test, achieving robustness in emotion reasoning comparable to that of closed-source models.

Technology Category

Application Category

📝 Abstract

Multimodal emotion analysis is shifting from static classification to generative reasoning. Beyond simple label prediction, robust affective reasoning must synthesize fine-grained signals such as facial micro-expressions and prosodic which shifts to decode the latent causality within complex social contexts. However, current Multimodal Large Language Models (MLLMs) face significant limitations in fine-grained perception, primarily due to data scarcity and insufficient cross-modal fusion. As a result, these models often exhibit unimodal dominance which leads to hallucinations in complex multimodal interactions, particularly when visual and acoustic cues are subtle, ambiguous, or even contradictory (e.g., in sarcastic scenery). To address this, we introduce SABER-LLM, a framework designed for robust multimodal reasoning. First, we construct SABER, a large-scale emotion reasoning dataset comprising 600K video clips, annotated with a novel six-dimensional schema that jointly captures audiovisual cues and causal logic. Second, we propose the structured evidence decomposition paradigm, which enforces a"perceive-then-reason"separation between evidence extraction and reasoning to alleviate unimodal dominance. The ability to perceive complex scenes is further reinforced by consistency-aware direct preference optimization, which explicitly encourages alignment among modalities under ambiguous or conflicting perceptual conditions. Experiments on EMER, EmoBench-M, and SABER-Test demonstrate that SABER-LLM significantly outperforms open-source baselines and achieves robustness competitive with closed-source models in decoding complex emotional dynamics. The dataset and model are available at https://github.com/zxzhao0/SABER-LLM.

Problem

Research questions and friction points this paper is trying to address.

multimodal emotion reasoning

fine-grained perception

unimodal dominance

cross-modal fusion

affective hallucination

Innovation

Methods, ideas, or system contributions that make the work stand out.

structured evidence decomposition

multimodal emotion reasoning

fine-grained audiovisual perception