VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

To address information loss from text parsing in visual-rich document (e.g., charts and tables in PDF/PPTX) question answering, this paper proposes the first end-to-end retrieval-augmented generation (RAG) framework tailored for multi-format visual documents. Methodologically, documents are uniformly rendered as images; visual information is compressed into dense token representations via self-supervised pretraining to enable fine-grained vision–language alignment; and the framework integrates a large vision-language model, cross-modal encoder, dense retriever, and generative decoder. Key contributions include: (1) introducing a vision-aware RAG paradigm that bypasses text-parsing distortions; (2) establishing OpenDocVQA—the first open-domain visual document QA benchmark; and (3) achieving significant improvements over conventional text-based RAG baselines across multiple metrics, demonstrating strong generalization and practical deployability.

Technology Category

Application Category

📝 Abstract

We aim to develop a retrieval-augmented generation (RAG) framework that answers questions over a corpus of visually-rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance, we propose novel self-supervised pre-training tasks that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting. Experiments show that VDocRAG substantially outperforms conventional text-based RAG and has strong generalization capability, highlighting the potential of an effective RAG paradigm for real-world documents.

Problem

Research questions and friction points this paper is trying to address.

Develop RAG framework for visually-rich mixed-modality documents

Improve performance via self-supervised vision-language pretraining tasks

Create unified OpenDocVQA dataset for training and evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified image format for diverse document types

Self-supervised pre-training for vision-language models

OpenDocVQA dataset for training and evaluation

🔎 Similar Papers

UniRAG: Universal Retrieval Augmentation for Large Vision Language Models