DocVXQA: Context-Aware Visual Explanations for Document Question Answering

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses Document Visual Question Answering (DocVQA) by proposing a visually self-explanatory model that jointly outputs accurate answers and context-aware, differentiable attention heatmaps—intuitively highlighting the key textual and visual regions underpinning each decision. To balance explanatory fidelity and predictive performance, we introduce, for the first time, a dual-objective quantifiable explanation learning paradigm grounded in *context sufficiency* and *representation efficiency*, moving beyond conventional answer-localization-only paradigms. Methodologically, our approach integrates multimodal feature alignment, a differentiable heatmap generation module, and an information-bottleneck-based explanation regularization loss. Our model achieves state-of-the-art accuracy across multiple DocVQA benchmarks while producing heatmaps rated significantly higher in human evaluation than those of baseline methods. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
We propose DocVXQA, a novel framework for visually self-explainable document question answering. The framework is designed not only to produce accurate answers to questions but also to learn visual heatmaps that highlight contextually critical regions, thereby offering interpretable justifications for the model's decisions. To integrate explanations into the learning process, we quantitatively formulate explainability principles as explicit learning objectives. Unlike conventional methods that emphasize only the regions pertinent to the answer, our framework delivers explanations that are extit{contextually sufficient} while remaining extit{representation-efficient}. This fosters user trust while achieving a balance between predictive performance and interpretability in DocVQA applications. Extensive experiments, including human evaluation, provide strong evidence supporting the effectiveness of our method. The code is available at https://github.com/dali92002/DocVXQA.
Problem

Research questions and friction points this paper is trying to address.

Develops visually explainable document question answering framework
Integrates context-aware visual heatmaps for interpretable justifications
Balances predictive performance and interpretability in DocVQA
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates visual heatmaps for document QA
Quantifies explainability as learning objectives
Ensures contextually sufficient and efficient explanations
🔎 Similar Papers
No similar papers found.