🤖 AI Summary
Current document visual question answering (DocVQA) models struggle to distinguish between question-relevant evidence and answer regions, and their reasoning processes often lack transparency and verifiability. This work proposes CoExVQA, a novel framework that introduces a chain-of-explanation mechanism to decouple the reasoning pipeline into three stages: evidence identification, answer localization, and answer generation. Specifically, the model first identifies textual evidence pertinent to the question, then explicitly locates the corresponding visual region, and finally performs constrained decoding solely from this grounded region to generate the answer. This design enables end-to-end traceable multimodal reasoning. Evaluated on the PFL-DocVQA benchmark, CoExVQA achieves state-of-the-art performance in interpretable DocVQA with a 12% absolute improvement in ANLS, while providing a transparent and verifiable prediction process.
📝 Abstract
Document Visual Question Answering (DocVQA) requires vision-language models to reason not only about what information in a document is relevant to a question, but also where the answer is grounded on the page. Existing DocVQA models entangle question-relevant evidence and answer localization and operate largely as black boxes, offering limited means to verify how predictions depend on visual evidence. We propose CoExVQA, a self-explainable DocVQA framework with a grounded reasoning process through a chain-of-explanation design. CoExVQA first identifies question-relevant evidence, then explicitly localizes the answer region, and finally decodes the answer exclusively from the grounded region. Prediction via CoExVQA's chain-of-explanation enables direct inspection and verification of the reasoning process across modalities. Empirical results show that restricting decoding to grounded evidence achieves SotA explainable DocVQA performance on PFL-DocVQA, improving ANLS by 12% over the current explainable baselines while providing transparent and verifiable predictions.