Contrast Sensitivity Function of Multimodal Vision-Language Models

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Prior work lacks systematic evaluation of how well multimodal vision-language models (VLMs) align with human low-level visual perception—specifically, the contrast sensitivity function (CSF)—a foundational psychophysical measure of spatial frequency–contrast sensitivity. Method: We introduce a psychophysics-inspired behavioral probing paradigm using bandpass-filtered noise stimuli, systematically testing pattern recognition performance of state-of-the-art VLMs across diverse spatial frequencies and contrast levels. We employ varied prompt formulations and cross-architectural response analysis to assess perceptual stability and biological plausibility. Results: No current VLM simultaneously reproduces both the shape and amplitude characteristics of the human CSF. Moreover, model responses exhibit strong sensitivity to prompt wording, revealing unstable, non-biological low-level visual processing. This work establishes the first benchmark for evaluating VLMs on human-aligned low-level vision, exposing critical gaps and providing concrete directions for developing perceptually grounded multimodal models.

Technology Category

Application Category

📝 Abstract

Assessing the alignment of multimodal vision-language models~(VLMs) with human perception is essential to understand how they perceive low-level visual features. A key characteristic of human vision is the contrast sensitivity function (CSF), which describes sensitivity to spatial frequency at low-contrasts. Here, we introduce a novel behavioral psychophysics-inspired method to estimate the CSF of chat-based VLMs by directly prompting them to judge pattern visibility at different contrasts for each frequency. This methodology is closer to the real experiments in psychophysics than the previously reported. Using band-pass filtered noise images and a diverse set of prompts, we assess model responses across multiple architectures. We find that while some models approximate human-like CSF shape or magnitude, none fully replicate both. Notably, prompt phrasing has a large effect on the responses, raising concerns about prompt stability. Our results provide a new framework for probing visual sensitivity in multimodal models and reveal key gaps between their visual representations and human perception.

Problem

Research questions and friction points this paper is trying to address.

Measuring contrast sensitivity in vision-language models

Assessing alignment with human visual perception

Evaluating prompt stability across model architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Behavioral psychophysics-inspired CSF estimation method

Direct prompting for pattern visibility judgments

Band-pass filtered noise images with diverse prompts

🔎 Similar Papers

Exploring Transferability of Multimodal Adversarial Samples for Vision-Language Pre-training Models with Contrastive Learning