ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval

๐Ÿ“… 2025-07-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Digital art understanding requires joint visual analysis and deep art-historical knowledge, yet existing methods heavily rely on structured metadata. Method: We propose a metadata-agnostic art analysis framework comprising (i) WikiFragmentsโ€”a large-scale dataset of unstructured Wikipedia art fragments; (ii) a late-interaction multimodal retrieval mechanism; and (iii) an agent-based contextual reasoning strategy integrating Qwen2.5-VL, a contrastive multi-task classification network, and retrieval-augmented generation (RAG). This enables style/genre/artist identification and knowledge-driven complex visual question answering. Results: Our approach achieves state-of-the-art performance across multiple benchmarks: +8.4% F1 on style classification and +7.1 BLEU@1 on ArtPedia. It accurately disentangles visual motifs and historical context even for obscure artworks. The core contribution is the first end-to-end, knowledge-enhanced, metadata-free framework for deep artistic understanding from raw images alone.

Technology Category

Application Category

๐Ÿ“ Abstract
Analyzing digitized artworks presents unique challenges, requiring not only visual interpretation but also a deep understanding of rich artistic, contextual, and historical knowledge. We introduce ArtSeek, a multimodal framework for art analysis that combines multimodal large language models with retrieval-augmented generation. Unlike prior work, our pipeline relies only on image input, enabling applicability to artworks without links to Wikidata or Wikipedia-common in most digitized collections. ArtSeek integrates three key components: an intelligent multimodal retrieval module based on late interaction retrieval, a contrastive multitask classification network for predicting artist, genre, style, media, and tags, and an agentic reasoning strategy enabled through in-context examples for complex visual question answering and artwork explanation via Qwen2.5-VL. Central to this approach is WikiFragments, a Wikipedia-scale dataset of image-text fragments curated to support knowledge-grounded multimodal reasoning. Our framework achieves state-of-the-art results on multiple benchmarks, including a +8.4% F1 improvement in style classification over GraphCLIP and a +7.1 BLEU@1 gain in captioning on ArtPedia. Qualitative analyses show that ArtSeek can interpret visual motifs, infer historical context, and retrieve relevant knowledge, even for obscure works. Though focused on visual arts, our approach generalizes to other domains requiring external knowledge, supporting scalable multimodal AI research. Both the dataset and the source code will be made publicly available at https://github.com/cilabuniba/artseek.
Problem

Research questions and friction points this paper is trying to address.

Analyzing artworks requires visual and contextual understanding
Existing methods need Wikidata links, limiting digitized collection use
ArtSeek integrates retrieval, classification, and reasoning for art analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal framework combining MLLMs with retrieval-augmented generation
Late interaction retrieval for intelligent multimodal knowledge grounding
Agentic reasoning via in-context examples for complex VQA
๐Ÿ”Ž Similar Papers
No similar papers found.