🤖 AI Summary
This work addresses the “attribute overload” problem in multimodal large language models for art emotion interpretation—where irrelevant visual attributes are excessively enumerated while neglecting the key cues that genuinely drive emotional responses. To tackle this, the authors propose an Attribute-Guided Selective Reasoning (AGSR) framework that models emotion understanding as the selective utilization of predefined formal attributes. They introduce the first FAB-G multi-agent architecture, which incorporates human-annotated emotion saliency supervision to decouple and jointly optimize attribute saliency prediction and emotion analysis. Evaluated on an extended EmoArt dataset comprising 1,400 artworks with saliency annotations, the method consistently improves performance in emotion, arousal, and valence prediction, generates more concise explanations, and achieves strong alignment with human saliency judgments under Dice and Tversky metrics, demonstrating robust cross-dataset generalization.
📝 Abstract
Multimodal large language models (MLLMs) can produce fluent artwork emotion explanations, but they often suffer from attribute flooding: they enumerate many visible formal attributes without identifying which cues actually support the affective judgment. We therefore formulate artwork emotion understanding as Attribute-Grounded Selective Reasoning (AGSR), where predefined formal attributes serve as evidence units and only emotionally operative attributes should enter the final interpretation. To make this problem measurable, we extend EmoArt, originally introduced at ACM MM 2025 as a 132,664-artwork resource with content, formal-attribute, valence-arousal, and emotion annotations, by adding a 1,400-artwork human salience extension annotated by 15 art-trained annotators. This extension provides instance-level supervision for distinguishing attributes that are merely present from those that are emotionally salient. We further propose FAB-G (Formal-Attribute Bottleneck-Guided reasoning), a supervised multi-agent framework that first predicts attribute-level salience and then constrains downstream emotional analysis to the retained cues. Experiments show that FAB-G yields consistent gains in emotion, arousal, and valence prediction, achieves stronger agreement with human-marked salient attributes under Dice and Tversky metrics, and produces substantially more compact final explanations than prompting-based baselines. Cross-dataset evaluation further suggests that attribute-grounded salience selection transfers beyond the source distribution of EmoArt, while also revealing attribute-specific boundary cases. The dataset and project page are available at https://zhiliangzhang.github.io/EmoArt-130k/