VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing Vision Retrieval-Augmented Generation (VRAG) systems struggle to reliably perceive and integrate multi-image evidence, resulting in weak reasoning foundations and frequent hallucinations. To address this, we propose EVisRAG—a novel evidence-guided, end-to-end VRAG framework that pioneers fine-grained reward binding and multi-image evidence aggregation for grounded reasoning. We introduce RS-GRPO, a unified training strategy that jointly optimizes four core components: vision-language understanding, external visual knowledge retrieval, multi-image evidence recording, and logical evidence aggregation. EVisRAG significantly enhances the precise localization and effective utilization of critical visual evidence. Empirical evaluation demonstrates an average 27% accuracy improvement across multiple visual question answering benchmarks, outperforming state-of-the-art baselines. By enabling verifiable, interpretable, and evidence-grounded multi-image reasoning, EVisRAG establishes a new paradigm for trustworthy multimodal inference.

Technology Category

Application Category

📝 Abstract

Visual retrieval-augmented generation (VRAG) augments vision-language models (VLMs) with external visual knowledge to ground reasoning and reduce hallucinations. Yet current VRAG systems often fail to reliably perceive and integrate evidence across multiple images, leading to weak grounding and erroneous conclusions. In this paper, we propose EVisRAG, an end-to-end framework that learns to reason with evidence-guided multi-image to address this issue. The model first observes retrieved images and records per-image evidence, then derives the final answer from the aggregated evidence. To train EVisRAG effectively, we introduce Reward-Scoped Group Relative Policy Optimization (RS-GRPO), which binds fine-grained rewards to scope-specific tokens to jointly optimize visual perception and reasoning abilities of VLMs. Experimental results on multiple visual question answering benchmarks demonstrate that EVisRAG delivers substantial end-to-end gains over backbone VLM with 27% improvements on average. Further analysis shows that, powered by RS-GRPO, EVisRAG improves answer accuracy by precisely perceiving and localizing question-relevant evidence across multiple images and deriving the final answer from that evidence, much like a real detective.

Problem

Research questions and friction points this paper is trying to address.

Improving multi-image evidence perception in visual reasoning

Reducing hallucinations in visual retrieval-augmented generation systems

Enhancing cross-image evidence integration for accurate conclusions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evidence-guided multi-image reasoning framework

Reward-scoped group relative policy optimization

Aggregated evidence from multiple images for answers

🔎 Similar Papers

UniRAG: Universal Retrieval Augmentation for Large Vision Language Models