KEVER^2: Knowledge-Enhanced Visual Emotion Reasoning and Retrieval

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Image sentiment understanding faces three key challenges: abstract and ambiguous affective cues, sparse supervisory signals, and cross-domain semantic ambiguity. To address these, we propose the first semantic-structured visual affective representation framework that enables interpretable emotion reasoning and retrieval without explicit emotion annotations, achieved by integrating external affective knowledge graphs with multimodal alignment mechanisms. Methodologically, we design a knowledge-enhanced vision-language model (VLM) incorporating affective semantic embedding, cross-modal alignment, and structured knowledge injection—thereby mitigating supervision scarcity and domain shift. Evaluated on Emotion6, EmoSet, and M-Disaster benchmarks, our approach achieves an average accuracy improvement of 12.3% and up to 19% gain in fine-grained emotion recognition, significantly outperforming existing unsupervised and weakly supervised methods.

Technology Category

Application Category

📝 Abstract

Understanding what emotions images evoke in their viewers is a foundational goal in human-centric visual computing. While recent advances in vision-language models (VLMs) have shown promise for visual emotion analysis (VEA), several key challenges remain unresolved. Emotional cues in images are often abstract, overlapping, and entangled, making them difficult to model and interpret. Moreover, VLMs struggle to align these complex visual patterns with emotional semantics due to limited supervision and sparse emotional grounding. Finally, existing approaches lack structured affective knowledge to resolve ambiguity and ensure consistent emotional reasoning across diverse visual domains. To address these limitations, we propose extbf{K-EVER extsuperscript{2}}, a knowledge-enhanced framework for emotion reasoning and retrieval. Our approach introduces a semantically structured formulation of visual emotion cues and integrates external affective knowledge through multimodal alignment. Without relying on handcrafted labels or direct emotion supervision, K-EVER extsuperscript{2} achieves robust and interpretable emotion predictions across heterogeneous image types. We validate our framework on three representative benchmarks, Emotion6, EmoSet, and M-Disaster, covering social media imagery, human-centric scenes, and disaster contexts. K-EVER extsuperscript{2} consistently outperforms strong CNN and VLM baselines, achieving up to a extbf{19% accuracy gain} for specific emotions and a extbf{12.3% average accuracy gain} across all emotion categories. Our results demonstrate a scalable and generalizable solution for advancing emotional understanding of visual content.

Problem

Research questions and friction points this paper is trying to address.

Modeling abstract and entangled emotional cues in images

Aligning complex visual patterns with emotional semantics

Lacking structured affective knowledge for consistent reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge-enhanced framework for emotion reasoning

Semantically structured visual emotion cues

Multimodal alignment with affective knowledge

🔎 Similar Papers

EmoEdit: Evoking Emotions through Image Manipulation