🤖 AI Summary
This study addresses the challenge of artwork classification, which requires simultaneously capturing fine-grained visual details and abstract stylistic attributes—tasks that are difficult for traditional modeling approaches. The work systematically evaluates self-supervised backbone networks, including DINO and CLIP, on painting classification and retrieval tasks, providing the first empirical validation of the superiority of self-supervised features in this domain. It further investigates diverse feature fusion techniques and classification strategies. Experimental results demonstrate that features derived from self-supervised models significantly improve classification accuracy and offer efficient, practical module designs for real-world applications such as navigation in virtual reality museums.
📝 Abstract
Classifying artworks presents a significant challenge due to the complex interplay of fine-grained details and abstract features that condition the style or genre of an artwork. This paper presents a systematic investigation of the effectiveness of supervised and self-supervised backbones as feature extractors for both artwork classification and retrieval, with a particular focus on paintings. We conduct an extensive experimental evaluation using the DINO family and CLIP models, assessing multiple classification strategies and feature representations. Our results demonstrate that employing a self-supervised backbone leads to consistent improvements in artwork classification performance. Moreover, our work provides insights into the applicability of classification and retrieval modules in real-world applications, such as virtual reality (VR) applications that support museum navigation.