🤖 AI Summary
Prior work lacks systematic evaluation of how well multimodal vision-language models (VLMs) align with human low-level visual perception—specifically, the contrast sensitivity function (CSF)—a foundational psychophysical measure of spatial frequency–contrast sensitivity.
Method: We introduce a psychophysics-inspired behavioral probing paradigm using bandpass-filtered noise stimuli, systematically testing pattern recognition performance of state-of-the-art VLMs across diverse spatial frequencies and contrast levels. We employ varied prompt formulations and cross-architectural response analysis to assess perceptual stability and biological plausibility.
Results: No current VLM simultaneously reproduces both the shape and amplitude characteristics of the human CSF. Moreover, model responses exhibit strong sensitivity to prompt wording, revealing unstable, non-biological low-level visual processing. This work establishes the first benchmark for evaluating VLMs on human-aligned low-level vision, exposing critical gaps and providing concrete directions for developing perceptually grounded multimodal models.
📝 Abstract
Assessing the alignment of multimodal vision-language models~(VLMs) with human perception is essential to understand how they perceive low-level visual features. A key characteristic of human vision is the contrast sensitivity function (CSF), which describes sensitivity to spatial frequency at low-contrasts. Here, we introduce a novel behavioral psychophysics-inspired method to estimate the CSF of chat-based VLMs by directly prompting them to judge pattern visibility at different contrasts for each frequency. This methodology is closer to the real experiments in psychophysics than the previously reported. Using band-pass filtered noise images and a diverse set of prompts, we assess model responses across multiple architectures. We find that while some models approximate human-like CSF shape or magnitude, none fully replicate both. Notably, prompt phrasing has a large effect on the responses, raising concerns about prompt stability. Our results provide a new framework for probing visual sensitivity in multimodal models and reveal key gaps between their visual representations and human perception.