AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the lack of a unified evaluation framework for assessing perceptual, reasoning, and generative capabilities of vision-language models in affective image analysis, particularly regarding emotional intensity calibration and descriptive depth. To bridge this gap, we introduce AICA-Bench—the first comprehensive benchmark dedicated to affective image content analysis—encompassing three core tasks: affective understanding, reasoning, and guided generation, with systematic evaluations of 23 state-of-the-art vision-language models. Furthermore, we propose Grounded Affective Tree (GAT), a training-free prompting framework that integrates visual grounding with hierarchical reasoning to significantly reduce emotional intensity prediction errors and enhance the semantic depth of generated captions. Extensive experiments validate GAT’s effectiveness across multiple models, establishing a strong baseline and a novel paradigm for affective multimodal research.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this gap, we introduce AICA-Bench, a comprehensive benchmark with three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). We evaluate 23 VLMs and identify two major limitations: weak intensity calibration and shallow open-ended descriptions. To address these issues, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that combines visual scaffolding with hierarchical reasoning. Experiments show that GAT reduces intensity errors and improves descriptive depth, providing a strong baseline for future research on affective multimodal understanding and generation.

Problem

Research questions and friction points this paper is trying to address.

Affective Image Content Analysis

Vision-Language Models

Emotion Understanding

Emotion Reasoning

Content Generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

AICA-Bench

Vision-Language Models

Grounded Affective Tree