I Spy With My Model's Eye: Visual Search as a Behavioural Test for MLLMs

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The visual processing mechanisms of multimodal large language models (MLLMs) remain largely opaque, hindering interpretability and principled evaluation. Method: This work innovatively adapts the classic visual search paradigm from cognitive psychology to systematically assess whether MLLMs exhibit human-like perceptual phenomena—specifically, the pop-out effect and sensitivity to natural scene priors (e.g., lighting direction). We conduct controlled single- and multi-feature search experiments, rigorously varying color, size, and illumination, and complement behavioral analysis with targeted fine-tuning and interpretability techniques. Contribution/Results: We demonstrate that state-of-the-art MLLMs robustly exhibit the pop-out effect in single-feature searches, show capacity limitations in conjunction searches, and effectively integrate natural lighting priors. Crucially, this is the first study to transplant human visual perception paradigms into MLLM evaluation, establishing a novel methodology and providing empirical evidence for understanding their underlying visual processing mechanisms.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) achieve strong performance on vision-language tasks, yet their visual processing is opaque. Most black-box evaluations measure task accuracy, but reveal little about underlying mechanisms. Drawing on cognitive psychology, we adapt classic visual search paradigms -- originally developed to study human perception -- to test whether MLLMs exhibit the ``pop-out'' effect, where salient visual features are detected independently of distractor set size. Using controlled experiments targeting colour, size and lighting features, we find that advanced MLLMs exhibit human-like pop-out effects in colour or size-based disjunctive (single feature) search, as well as capacity limits for conjunctive (multiple feature) search. We also find evidence to suggest that MLLMs, like humans, incorporate natural scene priors such as lighting direction into object representations. We reinforce our findings using targeted fine-tuning and mechanistic interpretability analyses. Our work shows how visual search can serve as a cognitively grounded diagnostic tool for evaluating perceptual capabilities in MLLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating visual processing mechanisms in MLLMs
Testing pop-out effects and capacity limits in MLLMs
Developing cognitive diagnostic tools for MLLM perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts visual search paradigms from cognitive psychology
Tests pop-out effects using color size lighting features
Uses fine-tuning and interpretability analyses for validation
🔎 Similar Papers
No similar papers found.
John Burden
John Burden
University of Cambridge
Reinforcement LearningArtificial IntelligenceLong-term AI SafetyAI Evaluation
J
Jonathan Prunty
Leverhulme Centre for the Future of Intelligence, University of Cambridge
B
Ben Slater
Leverhulme Centre for the Future of Intelligence, University of Cambridge
M
Matthieu Tehenan
Department of Computer Science, University of Cambridge
Greg Davis
Greg Davis
Department of Psychology, University of Cambridge
Lucy Cheke
Lucy Cheke
Professor of Experimental Psychology, Department of Psychology, Cambridge
Episodic MemoryMemory DevelopmentMemory impairmentComparative CognitionCognition in AI