🤖 AI Summary
To address information loss from text parsing in visual-rich document (e.g., charts and tables in PDF/PPTX) question answering, this paper proposes the first end-to-end retrieval-augmented generation (RAG) framework tailored for multi-format visual documents. Methodologically, documents are uniformly rendered as images; visual information is compressed into dense token representations via self-supervised pretraining to enable fine-grained vision–language alignment; and the framework integrates a large vision-language model, cross-modal encoder, dense retriever, and generative decoder. Key contributions include: (1) introducing a vision-aware RAG paradigm that bypasses text-parsing distortions; (2) establishing OpenDocVQA—the first open-domain visual document QA benchmark; and (3) achieving significant improvements over conventional text-based RAG baselines across multiple metrics, demonstrating strong generalization and practical deployability.
📝 Abstract
We aim to develop a retrieval-augmented generation (RAG) framework that answers questions over a corpus of visually-rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance, we propose novel self-supervised pre-training tasks that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting. Experiments show that VDocRAG substantially outperforms conventional text-based RAG and has strong generalization capability, highlighting the potential of an effective RAG paradigm for real-world documents.