🤖 AI Summary
Existing DocVQA methods suffer from clear limitations in explicit spatial relation modeling, computational efficiency for high-resolution documents, multi-hop reasoning capability, and decision interpretability. To address these challenges, we propose a multimodal framework integrating text encoding, spatial graph reasoning, and memory augmentation. First, we construct a token-level spatial graph to explicitly model fine-grained layout relationships. Second, we introduce a question-guided compression mechanism coupled with structured memory access, enabling efficient processing of high-resolution documents and controllable knowledge retrieval. Third, we generate interpretable decision paths to support visualization of the reasoning process and joint optimization of answer prediction and spatial localization. Evaluated on six mainstream benchmarks, our method achieves significant improvements in both accuracy and inference efficiency, and—critically—marks the first instance of consistent gains in both answer prediction and spatial localization performance.
📝 Abstract
Document Visual Question Answering (DocVQA) requires models to jointly understand textual semantics, spatial layout, and visual features. Current methods struggle with explicit spatial relationship modeling, inefficiency with high-resolution documents, multi-hop reasoning, and limited interpretability. We propose MGA-VQA, a multi-modal framework that integrates token-level encoding, spatial graph reasoning, memory-augmented inference, and question-guided compression. Unlike prior black-box models, MGA-VQA introduces interpretable graph-based decision pathways and structured memory access for enhanced reasoning transparency. Evaluation across six benchmarks (FUNSD, CORD, SROIE, DocVQA, STE-VQA, and RICO) demonstrates superior accuracy and efficiency, with consistent improvements in both answer prediction and spatial localization.