🤖 AI Summary
This study investigates the capabilities of large vision-language models (VLMs) on core art-historical discrimination tasks—zero-shot classification of painting style, artist, and period—where semantic complexity and contextual dependency pose significant challenges. To address the lack of domain-specific evaluation resources, we introduce ArTest, the first authoritative, art-history-oriented benchmark encompassing canonical movements and representative artworks. We systematically evaluate four state-of-the-art VLMs—CLIP, LLaVA, OpenFlamingo, and GPT-4o—on two publicly available datasets, conducting cross-model zero-shot transfer analysis. Results reveal a substantial performance gap: current VLMs underperform markedly on art-style classification compared to natural-image benchmarks (−23.6% average accuracy), indicating severe limitations in modeling art-historical semantics—including stylistic evolution, authorial technique, and historical context. This work establishes the first systematic, art-history-guided VLM evaluation framework and empirical benchmark, providing foundational tools for developing domain-specialized vision-language models.
📝 Abstract
The emergence of large Vision-Language Models (VLMs) has recently established new baselines in image classification across multiple domains. However, the performance of VLMs in the specific task of artwork classification, particularly art style classification of paintings - a domain traditionally mastered by art historians - has not been explored yet. Artworks pose a unique challenge compared to natural images due to their inherently complex and diverse structures, characterized by variable compositions and styles. Art historians have long studied the unique aspects of artworks, with style prediction being a crucial component of their discipline. This paper investigates whether large VLMs, which integrate visual and textual data, can effectively predict the art historical attributes of paintings. We conduct an in-depth analysis of four VLMs, namely CLIP, LLaVA, OpenFlamingo, and GPT-4o, focusing on zero-shot classification of art style, author and time period using two public benchmarks of artworks. Additionally, we present ArTest, a well-curated test set of artworks, including pivotal paintings studied by art historians.