ArtRAG: Retrieval-Augmented Generation with Structured Context for Visual Art Understanding

📅 2025-05-09

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Visual art understanding necessitates multi-perspective reasoning—encompassing cultural, historical, and stylistic dimensions—yet current multimodal large language models (MLLMs) exhibit limited capability in fine-grained art interpretation. To address this, we propose a training-free retrieval-augmented generation (RAG) framework. Our approach introduces the first Art Context Knowledge Graph (ACKG) and a novel multi-granularity topological-aware subgraph retrieval mechanism, enabling structured, interpretable, and culturally grounded generation. By leveraging explicit relational knowledge rather than parameter updates, our method eliminates dependence on model fine-tuning while significantly enhancing MLLMs’ performance on deep art understanding tasks. Experimental results demonstrate state-of-the-art performance on SemArt and Artpedia benchmarks, surpassing strong supervised baselines. Human evaluation further confirms substantial improvements in output coherence, analytical insight, and cultural depth.

Technology Category

Application Category

📝 Abstract

Understanding visual art requires reasoning across multiple perspectives -- cultural, historical, and stylistic -- beyond mere object recognition. While recent multimodal large language models (MLLMs) perform well on general image captioning, they often fail to capture the nuanced interpretations that fine art demands. We propose ArtRAG, a novel, training-free framework that combines structured knowledge with retrieval-augmented generation (RAG) for multi-perspective artwork explanation. ArtRAG automatically constructs an Art Context Knowledge Graph (ACKG) from domain-specific textual sources, organizing entities such as artists, movements, themes, and historical events into a rich, interpretable graph. At inference time, a multi-granular structured retriever selects semantically and topologically relevant subgraphs to guide generation. This enables MLLMs to produce contextually grounded, culturally informed art descriptions. Experiments on the SemArt and Artpedia datasets show that ArtRAG outperforms several heavily trained baselines. Human evaluations further confirm that ArtRAG generates coherent, insightful, and culturally enriched interpretations.

Problem

Research questions and friction points this paper is trying to address.

Enhancing visual art understanding with multi-perspective reasoning

Addressing MLLMs' limitations in nuanced fine art interpretation

Structuring art context knowledge for culturally informed descriptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework combining structured knowledge with RAG

Automatically constructs Art Context Knowledge Graph (ACKG)

Multi-granular structured retriever guides generation

🔎 Similar Papers

Have Large Vision-Language Models Mastered Art History?