ProtoVQA: An Adaptable Prototypical Framework for Explainable Fine-Grained Visual Question Answering

📅 2025-09-20

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

To address the critical need for trustworthy explanations in vision-language applications such as medical diagnosis and autonomous driving, this paper proposes ProtoVQA—the first unified, interpretable VQA framework grounded in question-aware visual prototypes. ProtoVQA employs a shared prototype backbone to jointly learn visual prototypes aligned with question semantics, serving as reasoning anchors that tightly couple answer generation with discriminative region localization. A spatially constrained matching mechanism ensures pixel-level coherence and semantic consistency of explanations. We further introduce the VLAS metric to quantitatively assess explanation faithfulness and fine-grainedness. Experiments on Visual7W demonstrate that ProtoVQA achieves state-of-the-art accuracy while significantly improving explanation comprehensibility, faithfulness, and model transparency—establishing a novel paradigm for explainable VQA modeling.

Technology Category

Application Category

📝 Abstract

Visual Question Answering (VQA) is increasingly used in diverse applications ranging from general visual reasoning to safety-critical domains such as medical imaging and autonomous systems, where models must provide not only accurate answers but also explanations that humans can easily understand and verify. Prototype-based modeling has shown promise for interpretability by grounding predictions in semantically meaningful regions for purely visual reasoning tasks, yet remains underexplored in the context of VQA. We present ProtoVQA, a unified prototypical framework that (i) learns question-aware prototypes that serve as reasoning anchors, connecting answers to discriminative image regions, (ii) applies spatially constrained matching to ensure that the selected evidence is coherent and semantically relevant, and (iii) supports both answering and grounding tasks through a shared prototype backbone. To assess explanation quality, we propose the Visual-Linguistic Alignment Score (VLAS), which measures how well the model's attended regions align with ground-truth evidence. Experiments on Visual7W show that ProtoVQA yields faithful, fine-grained explanations while maintaining competitive accuracy, advancing the development of transparent and trustworthy VQA systems.

Problem

Research questions and friction points this paper is trying to address.

Developing explainable VQA systems for safety-critical applications

Creating question-aware prototypes for visual-linguistic reasoning

Ensuring faithful explanations through spatially constrained matching

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns question-aware prototypes as reasoning anchors

Applies spatially constrained matching for coherent evidence

Uses shared prototype backbone for answering and grounding

🔎 Similar Papers

Enhanced Textual Feature Extraction for Visual Question Answering: A Simple Convolutional Approach