🤖 AI Summary
DocVQA poses dual challenges in long-document multimodal understanding and cross-modal reasoning; existing DocRAG methods over-rely on textual content, neglect visual cues, and lack reliable benchmarks for multimodal evidence selection and integration. To address this, we propose MMDocRAG—the first retrieval-augmented QA benchmark tailored for multi-page, heterogeneous documents containing interleaved text, figures, and tables—comprising 4,055 expert-annotated QA pairs with fine-grained multimodal evidence chains. We introduce the first image-text joint citation evaluation metric and a novel multimodal evidence chain annotation paradigm. Additionally, we design a fine-grained visual description enhancement mechanism that significantly improves open-source LLM performance on DocVQA. Large-scale evaluation across 60 VLMs/LLMs and 14 retrievers reveals systematic deficiencies in multimodal evidence retrieval and integration; high-fidelity visual descriptions prove critical for performance gains; and MMDocRAG has emerged as the authoritative benchmark in this domain.
📝 Abstract
Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents (text, images, tables) and performing cross-modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing critical visual information. The field also lacks robust benchmarks for assessing multimodal evidence selection and integration. We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innovative metrics for evaluating multimodal quote selection and enables answers that interleave text with relevant visual elements. Through large-scale experiments with 60 VLM/LLM models and 14 retrieval systems, we identify persistent challenges in multimodal evidence retrieval, selection, and integration.Key findings reveal advanced proprietary LVMs show superior performance than open-sourced alternatives. Also, they show moderate advantages using multimodal inputs over text-only inputs, while open-source alternatives show significant performance degradation. Notably, fine-tuned LLMs achieve substantial improvements when using detailed image descriptions. MMDocRAG establishes a rigorous testing ground and provides actionable insights for developing more robust multimodal DocVQA systems. Our benchmark and code are available at https://mmdocrag.github.io/MMDocRAG/.