ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval

📅 2025-07-29

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Digital art understanding requires joint visual analysis and deep art-historical knowledge, yet existing methods heavily rely on structured metadata. Method: We propose a metadata-agnostic art analysis framework comprising (i) WikiFragments—a large-scale dataset of unstructured Wikipedia art fragments; (ii) a late-interaction multimodal retrieval mechanism; and (iii) an agent-based contextual reasoning strategy integrating Qwen2.5-VL, a contrastive multi-task classification network, and retrieval-augmented generation (RAG). This enables style/genre/artist identification and knowledge-driven complex visual question answering. Results: Our approach achieves state-of-the-art performance across multiple benchmarks: +8.4% F1 on style classification and +7.1 BLEU@1 on ArtPedia. It accurately disentangles visual motifs and historical context even for obscure artworks. The core contribution is the first end-to-end, knowledge-enhanced, metadata-free framework for deep artistic understanding from raw images alone.

Technology Category

Application Category

📝 Abstract

Analyzing digitized artworks presents unique challenges, requiring not only visual interpretation but also a deep understanding of rich artistic, contextual, and historical knowledge. We introduce ArtSeek, a multimodal framework for art analysis that combines multimodal large language models with retrieval-augmented generation. Unlike prior work, our pipeline relies only on image input, enabling applicability to artworks without links to Wikidata or Wikipedia-common in most digitized collections. ArtSeek integrates three key components: an intelligent multimodal retrieval module based on late interaction retrieval, a contrastive multitask classification network for predicting artist, genre, style, media, and tags, and an agentic reasoning strategy enabled through in-context examples for complex visual question answering and artwork explanation via Qwen2.5-VL. Central to this approach is WikiFragments, a Wikipedia-scale dataset of image-text fragments curated to support knowledge-grounded multimodal reasoning. Our framework achieves state-of-the-art results on multiple benchmarks, including a +8.4% F1 improvement in style classification over GraphCLIP and a +7.1 BLEU@1 gain in captioning on ArtPedia. Qualitative analyses show that ArtSeek can interpret visual motifs, infer historical context, and retrieve relevant knowledge, even for obscure works. Though focused on visual arts, our approach generalizes to other domains requiring external knowledge, supporting scalable multimodal AI research. Both the dataset and the source code will be made publicly available at https://github.com/cilabuniba/artseek.

Problem

Research questions and friction points this paper is trying to address.

Analyzing artworks requires visual and contextual understanding

Existing methods need Wikidata links, limiting digitized collection use

ArtSeek integrates retrieval, classification, and reasoning for art analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal framework combining MLLMs with retrieval-augmented generation

Late interaction retrieval for intelligent multimodal knowledge grounding

Agentic reasoning via in-context examples for complex VQA

🔎 Similar Papers

Have Large Vision-Language Models Mastered Art History?