A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the limited interpretability and lack of explicit evidential support in existing art understanding models, which typically rely on implicit reasoning. To overcome these limitations, the authors propose A-MAR, a novel framework that introduces agent-based planning into art retrieval for the first time. A-MAR employs a structured reasoning plan to guide multimodal large language models in performing step-by-step, goal-directed evidence retrieval, thereby enabling fine-grained and interpretable art comprehension. The approach integrates explicit chain-of-thought reasoning with plan-conditioned retrieval and introduces ArtCoT-QA, a new benchmark for evaluation. Experiments demonstrate that A-MAR outperforms static retrieval methods and strong baselines on SemArt and Artpedia, while significantly enhancing evidence grounding and multi-hop reasoning capabilities on ArtCoT-QA.

Technology Category

Application Category

📝 Abstract

Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step-wise, grounded explanations. To evaluate agent-based multi- modal reasoning within the art domain, we introduce ArtCoT-QA. This diagnostic benchmark features multi-step reasoning chains for diverse art-related queries, enabling a granular analysis that extends beyond simple final answer accuracy. Experiments on SemArt and Artpedia show that A-MAR consistently outperforms static, non planned retrieval and strong MLLM baselines in final explanation quality, while evaluations on ArtCoT-QA further demonstrate its advantages in evidence grounding and multi-step reasoning ability. These results highlight the importance of reasoning-conditioned retrieval for knowledge-intensive multimodal understanding and position A-MAR as a step toward interpretable, goal-driven AI systems, with particular relevance to cultural industries. The code and data are available at: https://github.com/ShuaiWang97/A-MAR.

Problem

Research questions and friction points this paper is trying to address.

multimodal art retrieval

fine-grained artwork understanding

evidence grounding

multi-step reasoning

interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

agent-based retrieval

structured reasoning

multimodal art understanding