ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

📅 2025-02-25

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

To address key RAG bottlenecks in visually rich documents—including weak multimodal retrieval, shallow cross-modal understanding, and insufficient complex reasoning—this paper proposes ViDoRAG. Methodologically, it introduces the first Gaussian Mixture Model (GMM)-based hybrid multimodal retrieval mechanism, integrating vision-text joint embedding with learned re-ranking; designs a dynamic iterative agent workflow (“Explore–Summarize–Reflect”) to enable test-time reasoning expansion; and adopts a multi-agent collaborative architecture to strengthen cross-modal semantic alignment. Evaluated on the newly constructed ViDoSeek benchmark—a challenging visual document QA and multi-step reasoning benchmark—ViDoRAG achieves over 10% improvement over current state-of-the-art methods. The framework significantly advances performance in both visual document question answering and compositional reasoning tasks, establishing new capabilities for multimodal RAG systems.

Technology Category

Application Category

📝 Abstract

Understanding information from visually rich documents remains a significant challenge for traditional Retrieval-Augmented Generation (RAG) methods. Existing benchmarks predominantly focus on image-based question answering (QA), overlooking the fundamental challenges of efficient retrieval, comprehension, and reasoning within dense visual documents. To bridge this gap, we introduce ViDoSeek, a novel dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning. Based on it, we identify key limitations in current RAG approaches: (i) purely visual retrieval methods struggle to effectively integrate both textual and visual features, and (ii) previous approaches often allocate insufficient reasoning tokens, limiting their effectiveness. To address these challenges, we propose ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)-based hybrid strategy to effectively handle multi-modal retrieval. To further elicit the model's reasoning capabilities, we introduce an iterative agent workflow incorporating exploration, summarization, and reflection, providing a framework for investigating test-time scaling in RAG domains. Extensive experiments on ViDoSeek validate the effectiveness and generalization of our approach. Notably, ViDoRAG outperforms existing methods by over 10% on the competitive ViDoSeek benchmark.

Problem

Research questions and friction points this paper is trying to address.

Enhances retrieval in visually rich documents

Integrates textual and visual features effectively

Improves reasoning with iterative agent workflow

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Iterative Reasoning Agents

Gaussian Mixture Model retrieval

Exploration-Summarization-Reflection workflow

🔎 Similar Papers

UniRAG: Universal Retrieval Augmentation for Large Vision Language Models