Explanation-Driven Counterfactual Testing for Faithfulness in Vision-Language Model Explanations

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Visual language models (VLMs) frequently generate natural language explanations (NLEs) that appear superficially plausible yet lack causal credibility—posing both technical and governance risks. To address this, we propose the first automated verification framework treating NLEs as falsifiable hypotheses: it parses causal claims from natural language, synthesizes counterfactual images via generative editing, and evaluates causal consistency using large language model–driven reasoning, yielding a quantified faithfulness score. Evaluated on 120 OK-VQA samples across multiple state-of-the-art VLMs, our method uncovers pervasive causal failures and produces auditable, interpretable regulatory evidence. Our core contribution is the first end-to-end automation of counterfactual testing for NLEs—thereby bridging model interpretability with causal accountability and enabling rigorous, evidence-based oversight of VLM-generated explanations.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) often produce fluent Natural Language Explanations (NLEs) that sound convincing but may not reflect the causal factors driving predictions. This mismatch of plausibility and faithfulness poses technical and governance risks. We introduce Explanation-Driven Counterfactual Testing (EDCT), a fully automated verification procedure for a target VLM that treats the model's own explanation as a falsifiable hypothesis. Given an image-question pair, EDCT: (1) obtains the model's answer and NLE, (2) parses the NLE into testable visual concepts, (3) generates targeted counterfactual edits via generative inpainting, and (4) computes a Counterfactual Consistency Score (CCS) using LLM-assisted analysis of changes in both answers and explanations. Across 120 curated OK-VQA examples and multiple VLMs, EDCT uncovers substantial faithfulness gaps and provides regulator-aligned audit artifacts indicating when cited concepts fail causal tests.

Problem

Research questions and friction points this paper is trying to address.

Testing faithfulness of vision-language model explanations

Automated verification of causal factors in predictions

Identifying gaps between plausible and faithful explanations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated verification procedure for model explanations

Parses explanations into testable visual concepts

Generates counterfactual edits via generative inpainting

🔎 Similar Papers

Evaluating the Reliability of Self-Explanations in Large Language Models