I Spy With My Model's Eye: Visual Search as a Behavioural Test for MLLMs

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

The visual processing mechanisms of multimodal large language models (MLLMs) remain largely opaque, hindering interpretability and principled evaluation. Method: This work innovatively adapts the classic visual search paradigm from cognitive psychology to systematically assess whether MLLMs exhibit human-like perceptual phenomena—specifically, the pop-out effect and sensitivity to natural scene priors (e.g., lighting direction). We conduct controlled single- and multi-feature search experiments, rigorously varying color, size, and illumination, and complement behavioral analysis with targeted fine-tuning and interpretability techniques. Contribution/Results: We demonstrate that state-of-the-art MLLMs robustly exhibit the pop-out effect in single-feature searches, show capacity limitations in conjunction searches, and effectively integrate natural lighting priors. Crucially, this is the first study to transplant human visual perception paradigms into MLLM evaluation, establishing a novel methodology and providing empirical evidence for understanding their underlying visual processing mechanisms.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) achieve strong performance on vision-language tasks, yet their visual processing is opaque. Most black-box evaluations measure task accuracy, but reveal little about underlying mechanisms. Drawing on cognitive psychology, we adapt classic visual search paradigms -- originally developed to study human perception -- to test whether MLLMs exhibit the ``pop-out'' effect, where salient visual features are detected independently of distractor set size. Using controlled experiments targeting colour, size and lighting features, we find that advanced MLLMs exhibit human-like pop-out effects in colour or size-based disjunctive (single feature) search, as well as capacity limits for conjunctive (multiple feature) search. We also find evidence to suggest that MLLMs, like humans, incorporate natural scene priors such as lighting direction into object representations. We reinforce our findings using targeted fine-tuning and mechanistic interpretability analyses. Our work shows how visual search can serve as a cognitively grounded diagnostic tool for evaluating perceptual capabilities in MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating visual processing mechanisms in MLLMs

Testing pop-out effects and capacity limits in MLLMs

Developing cognitive diagnostic tools for MLLM perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts visual search paradigms from cognitive psychology

Tests pop-out effects using color size lighting features

Uses fine-tuning and interpretability analyses for validation

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?