Believing without Seeing: Quality Scores for Contextualizing Vision-Language Model Explanations

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Blind and low-vision users often uncritically accept erroneous predictions from vision-language models (VLMs) due to misleading natural-language explanations—especially in the absence of visual feedback. Method: This paper proposes an accessibility-oriented explanation quality assessment framework. It introduces two computationally tractable metrics—visual fidelity and contrastivity—grounded in natural language generation, vision-language semantic alignment modeling, and comparative analysis. The framework is calibrated on the A-OKVQA and VizWiz benchmarks without requiring human annotations. Contribution/Results: The method quantifies explanation credibility automatically, enabling users to dynamically calibrate trust in model predictions. A user study demonstrates that integrating quality scores improves prediction correctness judgment accuracy by 11.1% and reduces erroneous trust in incorrect predictions by 15.4%, significantly enhancing the reliability of human-AI collaborative decision-making in accessible interfaces.

Technology Category

Application Category

📝 Abstract

When people query Vision-Language Models (VLMs) but cannot see the accompanying visual context (e.g. for blind and low-vision users), augmenting VLM predictions with natural language explanations can signal which model predictions are reliable. However, prior work has found that explanations can easily convince users that inaccurate VLM predictions are correct. To remedy undesirable overreliance on VLM predictions, we propose evaluating two complementary qualities of VLM-generated explanations via two quality scoring functions. We propose Visual Fidelity, which captures how faithful an explanation is to the visual context, and Contrastiveness, which captures how well the explanation identifies visual details that distinguish the model's prediction from plausible alternatives. On the A-OKVQA and VizWiz tasks, these quality scoring functions are better calibrated with model correctness than existing explanation qualities. We conduct a user study in which participants have to decide whether a VLM prediction is accurate without viewing its visual context. We observe that showing our quality scores alongside VLM explanations improves participants' accuracy at predicting VLM correctness by 11.1%, including a 15.4% reduction in the rate of falsely believing incorrect predictions. These findings highlight the utility of explanation quality scores in fostering appropriate reliance on VLM predictions.

Problem

Research questions and friction points this paper is trying to address.

Evaluating visual fidelity of VLM explanations for reliability

Measuring contrastiveness to distinguish predictions from alternatives

Reducing user overreliance on incorrect VLM predictions via quality scores

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quality scores evaluate Visual Fidelity and Contrastiveness

Scores improve user accuracy without visual context

Reduces overreliance on incorrect VLM predictions

🔎 Similar Papers

Interpreting Neurons in Deep Vision Networks with Language Models