VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

📅 2024-10-14

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 1

career value

180K/year

🤖 AI Summary

Existing text-based RAG systems fail to model visual structures—such as layout and figures—in multimodal documents (e.g., PDFs, scanned images), leading to information loss and suboptimal performance. To address this, we propose the first end-to-end vision-first RAG framework: it takes raw document images as input and leverages vision-language models (VLMs) to jointly perform image-level retrieval and generation, eliminating distortions introduced by OCR and text parsing. Our method innovatively unifies retrieval and generation via joint optimization and introduces cross-modal attention to enable joint layout-semantic modeling. Key components include image-embedding-based retrieval, training on synthetically augmented multimodal data, and VLM-driven fused generation. Evaluated on multimodal document question answering, our approach achieves 20–40% higher end-to-end accuracy than conventional text RAG baselines, while demonstrating superior data efficiency and cross-domain generalization.

Technology Category

Application Category

📝 Abstract

Retrieval-augmented generation (RAG) is an effective technique that enables large language models (LLMs) to utilize external knowledge sources for generation. However, current RAG systems are solely based on text, rendering it impossible to utilize vision information like layout and images that play crucial roles in real-world multi-modality documents. In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. Compared to traditional text-based RAG, VisRAG maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process. We collect both open-source and synthetic data to train the retriever in VisRAG and explore a variety of generation methods. Experiments demonstrate that VisRAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 20--40% end-to-end performance gain over traditional text-based RAG pipeline. Further analysis reveals that VisRAG is efficient in utilizing training data and demonstrates strong generalization capability, positioning it as a promising solution for RAG on multi-modality documents. Our code and data are available at https://github.com/openbmb/visrag.

Problem

Research questions and friction points this paper is trying to address.

Enhances RAG by integrating vision-language models for multi-modality documents.

Addresses information loss in text-based RAG by directly embedding documents as images.

Improves retrieval and generation performance by 20-40% over traditional RAG systems.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language model for document embedding

Direct image-based retrieval for generation

Enhanced multi-modality document processing

🔎 Similar Papers

UniRAG: Universal Retrieval Augmentation for Large Vision Language Models