$G^2$-Reader: Dual Evolving Graphs for Multimodal Document QA

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses key limitations of conventional retrieval-augmented approaches in multimodal long-document question answering, where flat chunking disrupts document structure and cross-modal alignment, and iterative retrieval often suffers from local loops or noise drift. To overcome these challenges, the authors propose a dual-graph collaborative mechanism that simultaneously constructs a content graph to preserve the document’s native structure and cross-modal semantics, and a planning graph—a directed acyclic subproblem graph—to enable proactive multi-hop reasoning with explicit path guidance. This approach represents the first integration of structure-aware representation and agent-driven reasoning, facilitating globally aware multimodal evidence aggregation. Evaluated on VisDoMBench across five diverse multimodal domains, the method achieves an average accuracy of 66.21%, substantially outperforming strong baselines and GPT-5 (53.08%).

Technology Category

Application Category

📝 Abstract

Retrieval-augmented generation is a practical paradigm for question answering over long documents, but it remains brittle for multimodal reading where text, tables, and figures are interleaved across many pages. First, flat chunking breaks document-native structure and cross-modal alignment, yielding semantic fragments that are hard to interpret in isolation. Second, even iterative retrieval can fail in long contexts by looping on partial evidence or drifting into irrelevant sections as noise accumulates, since each step is guided only by the current snippet without a persistent global search state. We introduce $G^2$-Reader, a dual-graph system, to address both issues. It evolves a Content Graph to preserve document-native structure and cross-modal semantics, and maintains a Planning Graph, an agentic directed acyclic graph of sub-questions, to track intermediate findings and guide stepwise navigation for evidence completion. On VisDoMBench across five multimodal domains, $G^2$-Reader with Qwen3-VL-32B-Instruct reaches 66.21\% average accuracy, outperforming strong baselines and a standalone GPT-5 (53.08\%).

Problem

Research questions and friction points this paper is trying to address.

multimodal document QA

retrieval-augmented generation

long-document understanding

cross-modal alignment

semantic fragmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-graph architecture

multimodal document QA

content graph