Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries

📅 2025-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Most existing RAG benchmarks rely exclusively on textual knowledge bases, lacking evaluation of images as primary evidence sources. To address this gap, Visual-RAG introduces the first multimodal RAG benchmark specifically designed for vision-knowledge-intensive question answering. It requires models to retrieve cue images via text-to-image retrieval and directly extract visual evidence from raw images—fundamentally treating images themselves, rather than their textual descriptions, as core RAG evidence. The benchmark defines a novel query paradigm centered on visual knowledge density and establishes a comprehensive evaluation framework covering image retrieval, multimodal reasoning, visual evidence alignment, and cross-modal fusion. Systematic evaluation across five open-source and three closed-source multimodal large language models (MLLMs) demonstrates that images can serve as effective evidence sources; however, state-of-the-art models exhibit significant limitations in visual knowledge extraction and cross-modal integration capabilities.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) is a popular approach for enhancing Large Language Models (LLMs) by addressing their limitations in verifying facts and answering knowledge-intensive questions. As the research in LLM extends their capability to handle input modality other than text, e.g. image, several multimodal RAG benchmarks are proposed. Nonetheless, they mainly use textual knowledge bases as the primary source of evidences for augmentation. There still lack benchmarks designed to evaluate images as augmentation in RAG systems and how they leverage visual knowledge. We propose Visual-RAG, a novel Question Answering benchmark that emphasizes visual knowledge intensive questions. Unlike prior works relying on text-based evidence, Visual-RAG necessitates text-to-image retrieval and integration of relevant clue images to extract visual knowledge as evidence. With Visual-RAG, we evaluate 5 open-sourced and 3 proprietary Multimodal LLMs (MLLMs), revealing that images can serve as good evidence in RAG; however, even the SoTA models struggle with effectively extracting and utilizing visual knowledge
Problem

Research questions and friction points this paper is trying to address.

Evaluates text-to-image retrieval in RAG
Focuses on visual knowledge intensive queries
Assesses MLLMs' ability to use visual evidence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-to-image retrieval integration
Visual knowledge as evidence
Benchmarking multimodal LLMs