Towards Self-Explainable Document Visual Question Answering with Chain-of-Explanation Predictions

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

158K/year
🤖 AI Summary
Current document visual question answering (DocVQA) models struggle to distinguish between question-relevant evidence and answer regions, and their reasoning processes often lack transparency and verifiability. This work proposes CoExVQA, a novel framework that introduces a chain-of-explanation mechanism to decouple the reasoning pipeline into three stages: evidence identification, answer localization, and answer generation. Specifically, the model first identifies textual evidence pertinent to the question, then explicitly locates the corresponding visual region, and finally performs constrained decoding solely from this grounded region to generate the answer. This design enables end-to-end traceable multimodal reasoning. Evaluated on the PFL-DocVQA benchmark, CoExVQA achieves state-of-the-art performance in interpretable DocVQA with a 12% absolute improvement in ANLS, while providing a transparent and verifiable prediction process.
📝 Abstract
Document Visual Question Answering (DocVQA) requires vision-language models to reason not only about what information in a document is relevant to a question, but also where the answer is grounded on the page. Existing DocVQA models entangle question-relevant evidence and answer localization and operate largely as black boxes, offering limited means to verify how predictions depend on visual evidence. We propose CoExVQA, a self-explainable DocVQA framework with a grounded reasoning process through a chain-of-explanation design. CoExVQA first identifies question-relevant evidence, then explicitly localizes the answer region, and finally decodes the answer exclusively from the grounded region. Prediction via CoExVQA's chain-of-explanation enables direct inspection and verification of the reasoning process across modalities. Empirical results show that restricting decoding to grounded evidence achieves SotA explainable DocVQA performance on PFL-DocVQA, improving ANLS by 12% over the current explainable baselines while providing transparent and verifiable predictions.
Problem

Research questions and friction points this paper is trying to address.

Document Visual Question Answering
explainability
visual evidence
answer grounding
black-box models
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-explainable
chain-of-explanation
grounded reasoning
DocVQA
visual question answering
🔎 Similar Papers
K
Kjetil Indrehus
Department of Informatics, University of Oslo; Integreat – Norwegian Centre for Knowledge-driven Machine Learning; TRUST – The Norwegian Centre for Trustworthy AI
A
Adrian Duric
Department of Informatics, University of Oslo; Integreat – Norwegian Centre for Knowledge-driven Machine Learning; TRUST – The Norwegian Centre for Trustworthy AI
C
Changkyu Choi
Department of Informatics, University of Oslo
Ali Ramezani-Kebrya
Ali Ramezani-Kebrya
Associate Professor (with tenure) in CS at the University of Oslo
Machine Learning