Believing without Seeing: Quality Scores for Contextualizing Vision-Language Model Explanations

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Blind and low-vision users often uncritically accept erroneous predictions from vision-language models (VLMs) due to misleading natural-language explanations—especially in the absence of visual feedback. Method: This paper proposes an accessibility-oriented explanation quality assessment framework. It introduces two computationally tractable metrics—visual fidelity and contrastivity—grounded in natural language generation, vision-language semantic alignment modeling, and comparative analysis. The framework is calibrated on the A-OKVQA and VizWiz benchmarks without requiring human annotations. Contribution/Results: The method quantifies explanation credibility automatically, enabling users to dynamically calibrate trust in model predictions. A user study demonstrates that integrating quality scores improves prediction correctness judgment accuracy by 11.1% and reduces erroneous trust in incorrect predictions by 15.4%, significantly enhancing the reliability of human-AI collaborative decision-making in accessible interfaces.

Technology Category

Application Category

📝 Abstract
When people query Vision-Language Models (VLMs) but cannot see the accompanying visual context (e.g. for blind and low-vision users), augmenting VLM predictions with natural language explanations can signal which model predictions are reliable. However, prior work has found that explanations can easily convince users that inaccurate VLM predictions are correct. To remedy undesirable overreliance on VLM predictions, we propose evaluating two complementary qualities of VLM-generated explanations via two quality scoring functions. We propose Visual Fidelity, which captures how faithful an explanation is to the visual context, and Contrastiveness, which captures how well the explanation identifies visual details that distinguish the model's prediction from plausible alternatives. On the A-OKVQA and VizWiz tasks, these quality scoring functions are better calibrated with model correctness than existing explanation qualities. We conduct a user study in which participants have to decide whether a VLM prediction is accurate without viewing its visual context. We observe that showing our quality scores alongside VLM explanations improves participants' accuracy at predicting VLM correctness by 11.1%, including a 15.4% reduction in the rate of falsely believing incorrect predictions. These findings highlight the utility of explanation quality scores in fostering appropriate reliance on VLM predictions.
Problem

Research questions and friction points this paper is trying to address.

Evaluating visual fidelity of VLM explanations for reliability
Measuring contrastiveness to distinguish predictions from alternatives
Reducing user overreliance on incorrect VLM predictions via quality scores
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quality scores evaluate Visual Fidelity and Contrastiveness
Scores improve user accuracy without visual context
Reduces overreliance on incorrect VLM predictions
🔎 Similar Papers
No similar papers found.