MGA-VQA: Secure and Interpretable Graph-Augmented Visual Question Answering with Memory-Guided Protection Against Unauthorized Knowledge Use

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing DocVQA methods suffer from clear limitations in explicit spatial relation modeling, computational efficiency for high-resolution documents, multi-hop reasoning capability, and decision interpretability. To address these challenges, we propose a multimodal framework integrating text encoding, spatial graph reasoning, and memory augmentation. First, we construct a token-level spatial graph to explicitly model fine-grained layout relationships. Second, we introduce a question-guided compression mechanism coupled with structured memory access, enabling efficient processing of high-resolution documents and controllable knowledge retrieval. Third, we generate interpretable decision paths to support visualization of the reasoning process and joint optimization of answer prediction and spatial localization. Evaluated on six mainstream benchmarks, our method achieves significant improvements in both accuracy and inference efficiency, and—critically—marks the first instance of consistent gains in both answer prediction and spatial localization performance.

Technology Category

Application Category

📝 Abstract

Document Visual Question Answering (DocVQA) requires models to jointly understand textual semantics, spatial layout, and visual features. Current methods struggle with explicit spatial relationship modeling, inefficiency with high-resolution documents, multi-hop reasoning, and limited interpretability. We propose MGA-VQA, a multi-modal framework that integrates token-level encoding, spatial graph reasoning, memory-augmented inference, and question-guided compression. Unlike prior black-box models, MGA-VQA introduces interpretable graph-based decision pathways and structured memory access for enhanced reasoning transparency. Evaluation across six benchmarks (FUNSD, CORD, SROIE, DocVQA, STE-VQA, and RICO) demonstrates superior accuracy and efficiency, with consistent improvements in both answer prediction and spatial localization.

Problem

Research questions and friction points this paper is trying to address.

Modeling explicit spatial relationships in document understanding

Addressing inefficiency with high-resolution document processing

Enhancing interpretability and reasoning transparency in VQA systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-based reasoning for spatial relationship modeling

Memory-augmented inference for enhanced reasoning transparency

Question-guided compression for efficient document processing

🔎 Similar Papers

No similar papers found.