Artwork Interpretation with Vision Language Models: A Case Study on Emotions and Emotion Symbols

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This study systematically evaluates the capabilities of state-of-the-art vision-language models (VLMs)—specifically LLaVA-LLaMA and two variants of Qwen-VL—in recognizing emotions in art images. Addressing a gap in fine-grained affective understanding, we design four progressively demanding tasks: general content comprehension, emotion category identification, analysis of emotional expression modalities, and decoding of emotional symbols—enabling the first hierarchical assessment of both concrete and abstract emotional representations in art. Leveraging an expert-guided, multi-round question-answering qualitative analysis framework, we find that models perform robustly on figurative art but exhibit significant accuracy degradation and answer inconsistency on highly abstract or symbolic imagery. Our primary contribution is the construction of the first VLM evaluation benchmark explicitly aligned with the semantic hierarchy of artistic emotion, empirically revealing a structural limitation in current models: their inability to bridge embodied emotion perception with symbolic reasoning.

Technology Category

Application Category

📝 Abstract

Emotions are a fundamental aspect of artistic expression. Due to their abstract nature, there is a broad spectrum of emotion realization in artworks. These are subject to historical change and their analysis requires expertise in art history. In this article, we investigate which aspects of emotional expression can be detected by current (2025) vision language models (VLMs). We present a case study of three VLMs (Llava-Llama and two Qwen models) in which we ask these models four sets of questions of increasing complexity about artworks (general content, emotional content, expression of emotions, and emotion symbols) and carry out a qualitative expert evaluation. We find that the VLMs recognize the content of the images surprisingly well and often also which emotions they depict and how they are expressed. The models perform best for concrete images but fail for highly abstract or highly symbolic images. Reliable recognition of symbols remains fundamentally difficult. Furthermore, the models continue to exhibit the well-known LLM weakness of providing inconsistent answers to related questions.

Problem

Research questions and friction points this paper is trying to address.

Investigates VLMs' ability to detect emotional expression in artworks

Evaluates VLMs' performance on content, emotions, and emotion symbols

Examines VLMs' limitations with abstract, symbolic, and inconsistent responses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using vision language models to analyze emotional content in artworks

Qualitative expert evaluation of model performance on abstract images

Identifying limitations in symbol recognition and answer consistency

🔎 Similar Papers

EmoEdit: Evoking Emotions through Image Manipulation