MLDocRAG: Multimodal Long-Context Document Retrieval Augmented Generation

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of information fragmentation in multimodal long documents caused by cross-modal heterogeneity and cross-page reasoning. To this end, the authors propose a query-centric Multimodal Chunk-Query Graph (MCQG) framework that unifies heterogeneous content—such as text, images, and tables—into a shared query semantic space, constructing a structured graph representation to enable precise retrieval and interpretable evidence aggregation. By integrating multimodal document expansion, query generation, graph modeling, and retrieval-augmented generation (RAG), the framework supports joint cross-modal reasoning. Experimental results on the MMLongBench-Doc and LongDocURL benchmarks demonstrate significant improvements in both retrieval quality and question-answering accuracy, validating the effectiveness and novelty of the approach for multimodal long-context understanding.

Technology Category

Application Category

📝 Abstract
Understanding multimodal long-context documents that comprise multimodal chunks such as paragraphs, figures, and tables is challenging due to (1) cross-modal heterogeneity to localize relevant information across modalities, (2) cross-page reasoning to aggregate dispersed evidence across pages. To address these challenges, we are motivated to adopt a query-centric formulation that projects cross-modal and cross-page information into a unified query representation space, with queries acting as abstract semantic surrogates for heterogeneous multimodal content. In this paper, we propose a Multimodal Long-Context Document Retrieval Augmented Generation (MLDocRAG) framework that leverages a Multimodal Chunk-Query Graph (MCQG) to organize multimodal document content around semantically rich, answerable queries. MCQG is constructed via a multimodal document expansion process that generates fine-grained queries from heterogeneous document chunks and links them to their corresponding content across modalities and pages. This graph-based structure enables selective, query-centric retrieval and structured evidence aggregation, thereby enhancing grounding and coherence in multimodal long-context question answering. Experiments on datasets MMLongBench-Doc and LongDocURL demonstrate that MLDocRAG consistently improves retrieval quality and answer accuracy, demonstrating its effectiveness for multimodal long-context understanding.
Problem

Research questions and friction points this paper is trying to address.

multimodal
long-context
document understanding
cross-modal
cross-page reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Retrieval
Long-Context Understanding
Query-Centric Representation
Chunk-Query Graph
Retrieval-Augmented Generation
🔎 Similar Papers
No similar papers found.