🤖 AI Summary
To address the visual-semantic representation disjunction, severe semantic bottlenecks, and scarce labeled data in multi-task fine-grained analysis of artworks (i.e., style classification, artist attribution, dating estimation, and tag generation), this paper proposes ArtSAGENet—the first multimodal architecture integrating Graph Neural Networks (GNNs) into fine-grained art analysis. ArtSAGENet jointly extracts visual features via CNNs and models structured artist–artwork relationships using GNNs, while enforcing cross-modal alignment between vision and semantics through a knowledge-graph-guided mechanism. Trained end-to-end on multiple tasks, it significantly reduces data and computational requirements (10× faster training) yet consistently outperforms strong CNN baselines across all four tasks, achieving state-of-the-art performance. Moreover, the model exhibits superior generalization capability and inherent interpretability through its graph-based relational reasoning and knowledge-grounded alignment.
📝 Abstract
We propose ArtSAGENet, a novel multimodal architecture that integrates Graph Neural Networks (GNNs) and Convolutional Neural Networks (CNNs), to jointly learn visual and semantic-based artistic representations. First, we illustrate the significant advantages of multi-task learning for fine art analysis and argue that it is conceptually a much more appropriate setting in the fine art domain than the single-task alternatives. We further demonstrate that several GNN architectures can outperform strong CNN baselines in a range of fine art analysis tasks, such as style classification, artist attribution, creation period estimation, and tag prediction, while training them requires an order of magnitude less computational time and only a small amount of labeled data. Finally, through extensive experimentation we show that our proposed ArtSAGENet captures and encodes valuable relational dependencies between the artists and the artworks, surpassing the performance of traditional methods that rely solely on the analysis of visual content. Our findings underline a great potential of integrating visual content and semantics for fine art analysis and curation.