VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address information loss from text parsing in visual-rich document (e.g., charts and tables in PDF/PPTX) question answering, this paper proposes the first end-to-end retrieval-augmented generation (RAG) framework tailored for multi-format visual documents. Methodologically, documents are uniformly rendered as images; visual information is compressed into dense token representations via self-supervised pretraining to enable fine-grained vision–language alignment; and the framework integrates a large vision-language model, cross-modal encoder, dense retriever, and generative decoder. Key contributions include: (1) introducing a vision-aware RAG paradigm that bypasses text-parsing distortions; (2) establishing OpenDocVQA—the first open-domain visual document QA benchmark; and (3) achieving significant improvements over conventional text-based RAG baselines across multiple metrics, demonstrating strong generalization and practical deployability.

Technology Category

Application Category

📝 Abstract
We aim to develop a retrieval-augmented generation (RAG) framework that answers questions over a corpus of visually-rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance, we propose novel self-supervised pre-training tasks that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting. Experiments show that VDocRAG substantially outperforms conventional text-based RAG and has strong generalization capability, highlighting the potential of an effective RAG paradigm for real-world documents.
Problem

Research questions and friction points this paper is trying to address.

Develop RAG framework for visually-rich mixed-modality documents
Improve performance via self-supervised vision-language pretraining tasks
Create unified OpenDocVQA dataset for training and evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified image format for diverse document types
Self-supervised pre-training for vision-language models
OpenDocVQA dataset for training and evaluation
🔎 Similar Papers
No similar papers found.
R
Ryota Tanaka
NTT Human Informatics Laboratories, NTT Corporation
T
Taichi Iki
NTT Human Informatics Laboratories, NTT Corporation
T
Taku Hasegawa
NTT Human Informatics Laboratories, NTT Corporation
Kyosuke Nishida
Kyosuke Nishida
NTT Human Informatics Laboratories, NTT Corporation
natural language processingvision and languageartificial intelligencedata mining
K
Kuniko Saito
NTT Human Informatics Laboratories, NTT Corporation
Jun Suzuki
Jun Suzuki
Tohoku University
Natural Language ProcessingMachine LearningArtificial Intelligence