🤖 AI Summary
This work addresses Document Visual Question Answering (DocVQA) by proposing a visually self-explanatory model that jointly outputs accurate answers and context-aware, differentiable attention heatmaps—intuitively highlighting the key textual and visual regions underpinning each decision. To balance explanatory fidelity and predictive performance, we introduce, for the first time, a dual-objective quantifiable explanation learning paradigm grounded in *context sufficiency* and *representation efficiency*, moving beyond conventional answer-localization-only paradigms. Methodologically, our approach integrates multimodal feature alignment, a differentiable heatmap generation module, and an information-bottleneck-based explanation regularization loss. Our model achieves state-of-the-art accuracy across multiple DocVQA benchmarks while producing heatmaps rated significantly higher in human evaluation than those of baseline methods. The implementation is publicly available.
📝 Abstract
We propose DocVXQA, a novel framework for visually self-explainable document question answering. The framework is designed not only to produce accurate answers to questions but also to learn visual heatmaps that highlight contextually critical regions, thereby offering interpretable justifications for the model's decisions. To integrate explanations into the learning process, we quantitatively formulate explainability principles as explicit learning objectives. Unlike conventional methods that emphasize only the regions pertinent to the answer, our framework delivers explanations that are extit{contextually sufficient} while remaining extit{representation-efficient}. This fosters user trust while achieving a balance between predictive performance and interpretability in DocVQA applications. Extensive experiments, including human evaluation, provide strong evidence supporting the effectiveness of our method. The code is available at https://github.com/dali92002/DocVXQA.