🤖 AI Summary
This study investigates whether vision-language models (VLMs) exhibit human-like cognitive control—specifically, goal prioritization and interference suppression under conflict. Grounded in classic paradigms (Stroop, Flanker, Simon), we introduce the first psychophysical evaluation protocol tailored to multimodal foundation models, encompassing 108 models, 2,220 trials, and high-difficulty variants. Methodologically, we pioneer the adaptation of human executive function assessment protocols to VLMs, integrating quantitative behavioral analysis with cross-model comparison. Results demonstrate that, under resource constraints, VLMs display human-like executive function patterns and substantial inter-model variability; state-of-the-art models effectively suppress distractors and amplify target responses, exhibiting behavioral signatures highly consistent with human performance. This work establishes a novel, empirically grounded paradigm for modeling VLM cognition and evaluating controllability.
📝 Abstract
Cognitive control refers to the ability to flexibly coordinate thought and action in pursuit of internal goals. A standard method for assessing cognitive control involves conflict tasks that contrast congruent and incongruent trials, measuring the ability to prioritize relevant information while suppressing interference. We evaluate 108 vision-language models on three classic conflict tasks and their more demanding"squared"variants across 2,220 trials. Model performance corresponds closely to human behavior under resource constraints and reveals individual differences. These results indicate that some form of human-like executive function have emerged in current multi-modal foundational models.