KEVER^2: Knowledge-Enhanced Visual Emotion Reasoning and Retrieval

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Image sentiment understanding faces three key challenges: abstract and ambiguous affective cues, sparse supervisory signals, and cross-domain semantic ambiguity. To address these, we propose the first semantic-structured visual affective representation framework that enables interpretable emotion reasoning and retrieval without explicit emotion annotations, achieved by integrating external affective knowledge graphs with multimodal alignment mechanisms. Methodologically, we design a knowledge-enhanced vision-language model (VLM) incorporating affective semantic embedding, cross-modal alignment, and structured knowledge injection—thereby mitigating supervision scarcity and domain shift. Evaluated on Emotion6, EmoSet, and M-Disaster benchmarks, our approach achieves an average accuracy improvement of 12.3% and up to 19% gain in fine-grained emotion recognition, significantly outperforming existing unsupervised and weakly supervised methods.

Technology Category

Application Category

📝 Abstract
Understanding what emotions images evoke in their viewers is a foundational goal in human-centric visual computing. While recent advances in vision-language models (VLMs) have shown promise for visual emotion analysis (VEA), several key challenges remain unresolved. Emotional cues in images are often abstract, overlapping, and entangled, making them difficult to model and interpret. Moreover, VLMs struggle to align these complex visual patterns with emotional semantics due to limited supervision and sparse emotional grounding. Finally, existing approaches lack structured affective knowledge to resolve ambiguity and ensure consistent emotional reasoning across diverse visual domains. To address these limitations, we propose extbf{K-EVER extsuperscript{2}}, a knowledge-enhanced framework for emotion reasoning and retrieval. Our approach introduces a semantically structured formulation of visual emotion cues and integrates external affective knowledge through multimodal alignment. Without relying on handcrafted labels or direct emotion supervision, K-EVER extsuperscript{2} achieves robust and interpretable emotion predictions across heterogeneous image types. We validate our framework on three representative benchmarks, Emotion6, EmoSet, and M-Disaster, covering social media imagery, human-centric scenes, and disaster contexts. K-EVER extsuperscript{2} consistently outperforms strong CNN and VLM baselines, achieving up to a extbf{19% accuracy gain} for specific emotions and a extbf{12.3% average accuracy gain} across all emotion categories. Our results demonstrate a scalable and generalizable solution for advancing emotional understanding of visual content.
Problem

Research questions and friction points this paper is trying to address.

Modeling abstract and entangled emotional cues in images
Aligning complex visual patterns with emotional semantics
Lacking structured affective knowledge for consistent reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge-enhanced framework for emotion reasoning
Semantically structured visual emotion cues
Multimodal alignment with affective knowledge
🔎 Similar Papers
No similar papers found.
Fanhang Man
Fanhang Man
Tsinghua University
OptimizationsMultimodal LLM
Xiaoyue Chen
Xiaoyue Chen
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
Huandong Wang
Huandong Wang
Department of Electronic Engineering, Tsinghua University
mobile big data miningsocial media analysissoftware-defined networks
Baining Zhao
Baining Zhao
Tsinghua University
H
Han Li
Tsinghua Shenzhen International Graduate School, Tsinghua University, Beijing, China
X
Xinlei Chen
Tsinghua Shenzhen International Graduate School, Tsinghua University, Peng Cheng Labotory, Shenzhen, China
Y
Yong Li
Department of Electronic Engineering, Tsinghua University, Beijing, China