MLDocRAG: Multimodal Long-Context Document Retrieval Augmented Generation

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the challenge of information fragmentation in multimodal long documents caused by cross-modal heterogeneity and cross-page reasoning. To this end, the authors propose a query-centric Multimodal Chunk-Query Graph (MCQG) framework that unifies heterogeneous content—such as text, images, and tables—into a shared query semantic space, constructing a structured graph representation to enable precise retrieval and interpretable evidence aggregation. By integrating multimodal document expansion, query generation, graph modeling, and retrieval-augmented generation (RAG), the framework supports joint cross-modal reasoning. Experimental results on the MMLongBench-Doc and LongDocURL benchmarks demonstrate significant improvements in both retrieval quality and question-answering accuracy, validating the effectiveness and novelty of the approach for multimodal long-context understanding.

Technology Category

Application Category

📝 Abstract

Understanding multimodal long-context documents that comprise multimodal chunks such as paragraphs, figures, and tables is challenging due to (1) cross-modal heterogeneity to localize relevant information across modalities, (2) cross-page reasoning to aggregate dispersed evidence across pages. To address these challenges, we are motivated to adopt a query-centric formulation that projects cross-modal and cross-page information into a unified query representation space, with queries acting as abstract semantic surrogates for heterogeneous multimodal content. In this paper, we propose a Multimodal Long-Context Document Retrieval Augmented Generation (MLDocRAG) framework that leverages a Multimodal Chunk-Query Graph (MCQG) to organize multimodal document content around semantically rich, answerable queries. MCQG is constructed via a multimodal document expansion process that generates fine-grained queries from heterogeneous document chunks and links them to their corresponding content across modalities and pages. This graph-based structure enables selective, query-centric retrieval and structured evidence aggregation, thereby enhancing grounding and coherence in multimodal long-context question answering. Experiments on datasets MMLongBench-Doc and LongDocURL demonstrate that MLDocRAG consistently improves retrieval quality and answer accuracy, demonstrating its effectiveness for multimodal long-context understanding.

Problem

Research questions and friction points this paper is trying to address.

multimodal

long-context

document understanding

cross-modal

cross-page reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Retrieval

Long-Context Understanding

Query-Centric Representation