Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing methods struggle to evaluate the alignment between closed-source vision-language models (VLMs) and human perception in high-level scene understanding or to identify causal features underlying such judgments. This work proposes Counterfactual Semantic Saliency (CSS)—a black-box, model-agnostic framework that quantifies the causal contribution of individual objects to scene understanding by performing causal ablation of scene elements, measuring resultant shifts in semantic embedding space, and integrating high-fidelity counterfactual image generation with large-scale psychophysical experiments. For the first time, CSS enables systematic identification of discrepancies between VLMs and human perception without requiring internal model access. The study reveals that VLMs consistently exhibit size, centrality, and saliency biases, show weaker reliance on human figures compared to humans, and identifies object size as the primary driver of semantic divergence.

📝 Abstract

Evaluating whether large vision-language models (VLMs) align with human perception for high-level semantic scene comprehension remains a challenge. Traditional white-box interpretability methods are inapplicable to closed-source architectures and passive metrics fail to isolate causal features. We introduce Counterfactual Semantic Saliency (CSS). This black-box, model-agnostic framework quantifies the importance of objects by measuring the semantic shift induced by their causal ablation from a scene. To evaluate AI-human semantic alignment, we tested prominent VLMs against a human psychophysics baseline comprising 16,289 valid responses across 307 complex natural scenes and 1,306 high-fidelity counterfactual variants. Our analysis reveals a pervasive scene comprehension gap: models exhibit an overreliance (relative to humans) on large objects (size bias), objects at the center of the image (center bias), and high saliency objects. In contrast, models rely less on people in the scenes than our human participants to describe the images. A model's size bias is a primary driver explaining variations in model-human semantic divergence. Code and data will be available at https://github.com/starsky77/Counterfactual-Semantic-Saliency.

Problem

Research questions and friction points this paper is trying to address.

scene perception

vision-language models

semantic alignment

human perception

counterfactual analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual Semantic Saliency

vision-language models

causal ablation