AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of a unified evaluation framework for assessing perceptual, reasoning, and generative capabilities of vision-language models in affective image analysis, particularly regarding emotional intensity calibration and descriptive depth. To bridge this gap, we introduce AICA-Bench—the first comprehensive benchmark dedicated to affective image content analysis—encompassing three core tasks: affective understanding, reasoning, and guided generation, with systematic evaluations of 23 state-of-the-art vision-language models. Furthermore, we propose Grounded Affective Tree (GAT), a training-free prompting framework that integrates visual grounding with hierarchical reasoning to significantly reduce emotional intensity prediction errors and enhance the semantic depth of generated captions. Extensive experiments validate GAT’s effectiveness across multiple models, establishing a strong baseline and a novel paradigm for affective multimodal research.
📝 Abstract
Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this gap, we introduce AICA-Bench, a comprehensive benchmark with three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). We evaluate 23 VLMs and identify two major limitations: weak intensity calibration and shallow open-ended descriptions. To address these issues, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that combines visual scaffolding with hierarchical reasoning. Experiments show that GAT reduces intensity errors and improves descriptive depth, providing a strong baseline for future research on affective multimodal understanding and generation.
Problem

Research questions and friction points this paper is trying to address.

Affective Image Content Analysis
Vision-Language Models
Emotion Understanding
Emotion Reasoning
Content Generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

AICA-Bench
Vision-Language Models
Grounded Affective Tree
Emotion Reasoning
Emotion-Guided Generation
🔎 Similar Papers
No similar papers found.
Dong She
Dong She
University of Science and Technology of China
Computer vison
X
Xianrong Yao
School of Future Technology, South China University of Technology, Guangzhou
L
Liqun Chen
School of Future Technology, South China University of Technology, Guangzhou
J
Jinghe Yu
School of Future Technology, South China University of Technology, Guangzhou
Yang Gao
Yang Gao
South China University of Technology
HCIPervasive Computing
Zhanpeng Jin
Zhanpeng Jin
Xinshi Endowed Professor, South China University of Technology
Human-centered computingubiquitous computinghuman-computer interactionsmart health