MoLoRAG: Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval

📅 2025-09-05

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address multimodal information loss in multi-page document understanding, input-length limitations of large vision-language models (LVLMs), and the overreliance of existing RAG methods on semantic—rather than logical—document relationships, this paper proposes a logic-aware retrieval-augmented generation framework. Our method constructs an explicit page graph to model both document structure and cross-page logical dependencies, integrating semantic and logical relevance for retrieval. It supports two deployment modes—zero-shot and fine-tuned—without requiring LVLM-specific training, and is compatible with arbitrary LVLMs. Leveraging a lightweight vision-language model, our approach performs graph-guided traversal to jointly model multimodal content and contextual logic. Experiments on four DocQA benchmarks demonstrate average improvements of 9.68% in question-answering accuracy and 7.44% in retrieval precision. The code and datasets are publicly available.

Technology Category

Application Category

📝 Abstract

Document Understanding is a foundational AI capability with broad applications, and Document Question Answering (DocQA) is a key evaluation task. Traditional methods convert the document into text for processing by Large Language Models (LLMs), but this process strips away critical multi-modal information like figures. While Large Vision-Language Models (LVLMs) address this limitation, their constrained input size makes multi-page document comprehension infeasible. Retrieval-augmented generation (RAG) methods mitigate this by selecting relevant pages, but they rely solely on semantic relevance, ignoring logical connections between pages and the query, which is essential for reasoning. To this end, we propose MoLoRAG, a logic-aware retrieval framework for multi-modal, multi-page document understanding. By constructing a page graph that captures contextual relationships between pages, a lightweight VLM performs graph traversal to retrieve relevant pages, including those with logical connections often overlooked. This approach combines semantic and logical relevance to deliver more accurate retrieval. After retrieval, the top-$K$ pages are fed into arbitrary LVLMs for question answering. To enhance flexibility, MoLoRAG offers two variants: a training-free solution for easy deployment and a fine-tuned version to improve logical relevance checking. Experiments on four DocQA datasets demonstrate average improvements of 9.68% in accuracy over LVLM direct inference and 7.44% in retrieval precision over baselines. Codes and datasets are released at https://github.com/WxxShirley/MoLoRAG.

Problem

Research questions and friction points this paper is trying to address.

Addresses multi-modal document understanding limitations in large vision-language models

Solves retrieval of logically connected pages beyond semantic relevance

Enables accurate multi-page document question answering with improved precision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal logic-aware retrieval framework

Page graph construction for contextual relationships

Graph traversal for semantic and logical relevance

🔎 Similar Papers

fPLSA: Learning Semantic Structures in Document Collections Using Foundation Models