🤖 AI Summary
This work addresses the limitations of existing multimodal emotion recognition approaches in achieving fine-grained perception and reasoning of emotional cues, which stem from insufficient sensitivity of general-purpose encoders to affective signals, scarcity of high-quality annotated data, and the absence of cue-level evaluation benchmarks. To overcome these challenges, the authors propose XEmoGPT, a novel framework incorporating dedicated Video and Audio Emotion Cue Bridge modules (VECB/AECB) to enhance the fine-grained perceptual capabilities of modality-specific encoders. They also introduce EmoCue, a large-scale dataset with fine-grained emotion cue annotations, alongside EmoCue-360—a semantic similarity–driven automatic evaluation metric—and EmoCue-Eval, an expert-based benchmark. This enables, for the first time, interpretable multimodal perception and reasoning at the emotion cue level. Experimental results demonstrate that the proposed approach significantly outperforms state-of-the-art methods in both emotion cue perception and reasoning, confirming its effectiveness and interpretability.
📝 Abstract
Explainable Multimodal Emotion Recognition plays a crucial role in applications such as human-computer interaction and social media analytics. However, current approaches struggle with cue-level perception and reasoning due to two main challenges: 1) general-purpose modality encoders are pretrained to capture global structures and general semantics rather than fine-grained emotional cues, resulting in limited sensitivity to emotional signals; and 2) available datasets usually involve a trade-off between annotation quality and scale, which leads to insufficient supervision for emotional cues and ultimately limits cue-level reasoning. Moreover, existing evaluation metrics are inadequate for assessing cue-level reasoning performance. To address these challenges, we propose eXplainable Emotion GPT (XEmoGPT), a novel EMER framework capable of both perceiving and reasoning over emotional cues. It incorporates two specialized modules: the Video Emotional Cue Bridge (VECB) and the Audio Emotional Cue Bridge (AECB), which enhance the video and audio encoders through carefully designed tasks for fine-grained emotional cue perception. To further support cue-level reasoning, we construct a large-scale dataset, EmoCue, designed to teach XEmoGPT how to reason over multimodal emotional cues. In addition, we introduce EmoCue-360, an automated metric that extracts and matches emotional cues using semantic similarity, and release EmoCue-Eval, a benchmark of 400 expert-annotated samples covering diverse emotional scenarios. Experimental results show that XEmoGPT achieves strong performance in both emotional cue perception and reasoning.