🤖 AI Summary
This work addresses the lack of a unified evaluation framework for assessing perceptual, reasoning, and generative capabilities of vision-language models in affective image analysis, particularly regarding emotional intensity calibration and descriptive depth. To bridge this gap, we introduce AICA-Bench—the first comprehensive benchmark dedicated to affective image content analysis—encompassing three core tasks: affective understanding, reasoning, and guided generation, with systematic evaluations of 23 state-of-the-art vision-language models. Furthermore, we propose Grounded Affective Tree (GAT), a training-free prompting framework that integrates visual grounding with hierarchical reasoning to significantly reduce emotional intensity prediction errors and enhance the semantic depth of generated captions. Extensive experiments validate GAT’s effectiveness across multiple models, establishing a strong baseline and a novel paradigm for affective multimodal research.
📝 Abstract
Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this gap, we introduce AICA-Bench, a comprehensive benchmark with three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). We evaluate 23 VLMs and identify two major limitations: weak intensity calibration and shallow open-ended descriptions. To address these issues, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that combines visual scaffolding with hierarchical reasoning. Experiments show that GAT reduces intensity errors and improves descriptive depth, providing a strong baseline for future research on affective multimodal understanding and generation.