DMAP: Human-Aligned Structural Document Map for Multimodal Document Understanding

📅 2026-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing multimodal document question answering methods, which often neglect the hierarchical structure and logical-spatial relationships among document elements, leading to incoherent understanding. To overcome this, the authors propose a Structured Document Map (DMAP)—a novel document-level structural representation aligned with human cognitive processes. DMAP organizes heterogeneous elements such as text and figures into a hierarchical graph via a structure-semantic understanding agent, followed by iterative, structure-aware, evidence-driven reasoning performed by a reflective reasoning agent. Integrated with multimodal retrieval-augmented generation (RAG), the proposed framework significantly improves retrieval accuracy, reasoning consistency, and multimodal comprehension on the MMDocQA benchmark, outperforming conventional RAG approaches.

Technology Category

Application Category

📝 Abstract
Existing multimodal document question-answering (QA) systems predominantly rely on flat semantic retrieval, representing documents as a set of disconnected text chunks and largely neglecting their intrinsic hierarchical and relational structures. Such flattening disrupts logical and spatial dependencies - such as section organization, figure-text correspondence, and cross-reference relations, that humans naturally exploit for comprehension. To address this limitation, we introduce a document-level structural Document MAP (DMAP), which explicitly encodes both hierarchical organization and inter-element relationships within multimodal documents. Specifically, we design a Structured-Semantic Understanding Agent to construct DMAP by organizing textual content together with figures, tables, charts, etc. into a human-aligned hierarchical schema that captures both semantic and layout dependencies. Building upon this representation, a Reflective Reasoning Agent performs structure-aware and evidence-driven reasoning, dynamically assessing the sufficiency of retrieved context and iteratively refining answers through targeted interactions with DMAP. Extensive experiments on MMDocQA benchmarks demonstrate that DMAP yields document-specific structural representations aligned with human interpretive patterns, substantially enhancing retrieval precision, reasoning consistency, and multimodal comprehension over conventional RAG-based approaches. Code is available at https://github.com/Forlorin/DMAP
Problem

Research questions and friction points this paper is trying to address.

multimodal document understanding
document structure
hierarchical organization
relational dependencies
document question answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

structural document representation
hierarchical reasoning
multimodal document understanding
reflective reasoning agent
human-aligned schema
🔎 Similar Papers
No similar papers found.
S
ShunLiang Fu
Nanjing University of Science and Technology
Y
Yanxin Zhang
University of Wisconsin–Madison
Y
Yixin Xiang
Nanjing University of Science and Technology
Xiaoyu Du
Xiaoyu Du
Nanjing University of Science and Technology
MultimediaRecommendationParallel Computing
J
Jinhui Tang
Nanjing Forestry University