Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

204K/year
🤖 AI Summary
Existing methods struggle to evaluate the alignment between closed-source vision-language models (VLMs) and human perception in high-level scene understanding or to identify causal features underlying such judgments. This work proposes Counterfactual Semantic Saliency (CSS)—a black-box, model-agnostic framework that quantifies the causal contribution of individual objects to scene understanding by performing causal ablation of scene elements, measuring resultant shifts in semantic embedding space, and integrating high-fidelity counterfactual image generation with large-scale psychophysical experiments. For the first time, CSS enables systematic identification of discrepancies between VLMs and human perception without requiring internal model access. The study reveals that VLMs consistently exhibit size, centrality, and saliency biases, show weaker reliance on human figures compared to humans, and identifies object size as the primary driver of semantic divergence.
📝 Abstract
Evaluating whether large vision-language models (VLMs) align with human perception for high-level semantic scene comprehension remains a challenge. Traditional white-box interpretability methods are inapplicable to closed-source architectures and passive metrics fail to isolate causal features. We introduce Counterfactual Semantic Saliency (CSS). This black-box, model-agnostic framework quantifies the importance of objects by measuring the semantic shift induced by their causal ablation from a scene. To evaluate AI-human semantic alignment, we tested prominent VLMs against a human psychophysics baseline comprising 16,289 valid responses across 307 complex natural scenes and 1,306 high-fidelity counterfactual variants. Our analysis reveals a pervasive scene comprehension gap: models exhibit an overreliance (relative to humans) on large objects (size bias), objects at the center of the image (center bias), and high saliency objects. In contrast, models rely less on people in the scenes than our human participants to describe the images. A model's size bias is a primary driver explaining variations in model-human semantic divergence. Code and data will be available at https://github.com/starsky77/Counterfactual-Semantic-Saliency.
Problem

Research questions and friction points this paper is trying to address.

scene perception
vision-language models
semantic alignment
human perception
counterfactual analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual Semantic Saliency
vision-language models
causal ablation
semantic alignment
human psychophysics
🔎 Similar Papers
No similar papers found.