XEmoGPT: An Explainable Multimodal Emotion Recognition Framework with Cue-Level Perception and Reasoning

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of existing multimodal emotion recognition approaches in achieving fine-grained perception and reasoning of emotional cues, which stem from insufficient sensitivity of general-purpose encoders to affective signals, scarcity of high-quality annotated data, and the absence of cue-level evaluation benchmarks. To overcome these challenges, the authors propose XEmoGPT, a novel framework incorporating dedicated Video and Audio Emotion Cue Bridge modules (VECB/AECB) to enhance the fine-grained perceptual capabilities of modality-specific encoders. They also introduce EmoCue, a large-scale dataset with fine-grained emotion cue annotations, alongside EmoCue-360—a semantic similarity–driven automatic evaluation metric—and EmoCue-Eval, an expert-based benchmark. This enables, for the first time, interpretable multimodal perception and reasoning at the emotion cue level. Experimental results demonstrate that the proposed approach significantly outperforms state-of-the-art methods in both emotion cue perception and reasoning, confirming its effectiveness and interpretability.

Technology Category

Application Category

📝 Abstract

Explainable Multimodal Emotion Recognition plays a crucial role in applications such as human-computer interaction and social media analytics. However, current approaches struggle with cue-level perception and reasoning due to two main challenges: 1) general-purpose modality encoders are pretrained to capture global structures and general semantics rather than fine-grained emotional cues, resulting in limited sensitivity to emotional signals; and 2) available datasets usually involve a trade-off between annotation quality and scale, which leads to insufficient supervision for emotional cues and ultimately limits cue-level reasoning. Moreover, existing evaluation metrics are inadequate for assessing cue-level reasoning performance. To address these challenges, we propose eXplainable Emotion GPT (XEmoGPT), a novel EMER framework capable of both perceiving and reasoning over emotional cues. It incorporates two specialized modules: the Video Emotional Cue Bridge (VECB) and the Audio Emotional Cue Bridge (AECB), which enhance the video and audio encoders through carefully designed tasks for fine-grained emotional cue perception. To further support cue-level reasoning, we construct a large-scale dataset, EmoCue, designed to teach XEmoGPT how to reason over multimodal emotional cues. In addition, we introduce EmoCue-360, an automated metric that extracts and matches emotional cues using semantic similarity, and release EmoCue-Eval, a benchmark of 400 expert-annotated samples covering diverse emotional scenarios. Experimental results show that XEmoGPT achieves strong performance in both emotional cue perception and reasoning.

Problem

Research questions and friction points this paper is trying to address.

Explainable Multimodal Emotion Recognition

Cue-Level Perception

Cue-Level Reasoning

Emotional Cues

Multimodal Emotion Recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Explainable Multimodal Emotion Recognition

Cue-Level Perception

Emotional Cue Bridge

EmoCue Dataset

EmoCue-360 Metric

🔎 Similar Papers

No similar papers found.

Authors to Follow