XEmoGPT: An Explainable Multimodal Emotion Recognition Framework with Cue-Level Perception and Reasoning

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing multimodal emotion recognition approaches in achieving fine-grained perception and reasoning of emotional cues, which stem from insufficient sensitivity of general-purpose encoders to affective signals, scarcity of high-quality annotated data, and the absence of cue-level evaluation benchmarks. To overcome these challenges, the authors propose XEmoGPT, a novel framework incorporating dedicated Video and Audio Emotion Cue Bridge modules (VECB/AECB) to enhance the fine-grained perceptual capabilities of modality-specific encoders. They also introduce EmoCue, a large-scale dataset with fine-grained emotion cue annotations, alongside EmoCue-360—a semantic similarity–driven automatic evaluation metric—and EmoCue-Eval, an expert-based benchmark. This enables, for the first time, interpretable multimodal perception and reasoning at the emotion cue level. Experimental results demonstrate that the proposed approach significantly outperforms state-of-the-art methods in both emotion cue perception and reasoning, confirming its effectiveness and interpretability.

Technology Category

Application Category

📝 Abstract
Explainable Multimodal Emotion Recognition plays a crucial role in applications such as human-computer interaction and social media analytics. However, current approaches struggle with cue-level perception and reasoning due to two main challenges: 1) general-purpose modality encoders are pretrained to capture global structures and general semantics rather than fine-grained emotional cues, resulting in limited sensitivity to emotional signals; and 2) available datasets usually involve a trade-off between annotation quality and scale, which leads to insufficient supervision for emotional cues and ultimately limits cue-level reasoning. Moreover, existing evaluation metrics are inadequate for assessing cue-level reasoning performance. To address these challenges, we propose eXplainable Emotion GPT (XEmoGPT), a novel EMER framework capable of both perceiving and reasoning over emotional cues. It incorporates two specialized modules: the Video Emotional Cue Bridge (VECB) and the Audio Emotional Cue Bridge (AECB), which enhance the video and audio encoders through carefully designed tasks for fine-grained emotional cue perception. To further support cue-level reasoning, we construct a large-scale dataset, EmoCue, designed to teach XEmoGPT how to reason over multimodal emotional cues. In addition, we introduce EmoCue-360, an automated metric that extracts and matches emotional cues using semantic similarity, and release EmoCue-Eval, a benchmark of 400 expert-annotated samples covering diverse emotional scenarios. Experimental results show that XEmoGPT achieves strong performance in both emotional cue perception and reasoning.
Problem

Research questions and friction points this paper is trying to address.

Explainable Multimodal Emotion Recognition
Cue-Level Perception
Cue-Level Reasoning
Emotional Cues
Multimodal Emotion Recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explainable Multimodal Emotion Recognition
Cue-Level Perception
Emotional Cue Bridge
EmoCue Dataset
EmoCue-360 Metric
🔎 Similar Papers
No similar papers found.
H
Hanwen Zhang
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
Yao Liu
Yao Liu
Professor of Computer Science, University of South Florida
Computer and Network Security
P
Peiyuan Jiang
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
J
Junjie Lang
54th Research Institute, China Electronics Technology Group Corporation, Shijiazhuang, Hebei, China
X
Xie Jun
54th Research Institute, China Electronics Technology Group Corporation, Shijiazhuang, Hebei, China
Y
Yihui He
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
Y
Yajiao Deng
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
S
Siyu Du
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
Q
Qiao Liu
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, China