🤖 AI Summary
Despite strong performance on standard benchmarks, vision-language models (VLMs) underperform humans on elementary visual reasoning tasks. This work identifies the core bottleneck as a lack of human-like sequential visual processing capability.
Method: Leveraging human reaction time as a proxy for sequential processing load, we conduct controlled experiments across three canonical visual reasoning domains—geometric reasoning, perceptual counting, and mental rotation—systematically varying sequence complexity.
Results: VLM accuracy declines significantly with increasing sequential load, exhibiting a strong correlation with human reaction-time elongation; the human–VLM performance gap widens monotonically as task sequentiality increases. This study is the first to characterize the fundamental divergence between VLMs and human visual reasoning through the lens of cognitive load, revealing that current models lack intrinsic mechanisms for incremental, stepwise visual interpretation. Our findings provide both theoretical grounding and empirical evidence for developing next-generation models endowed with human-inspired sequential visual understanding capabilities.
📝 Abstract
Why do Vision Language Models (VLMs), despite success on standard benchmarks, often fail to match human performance on surprisingly simple visual reasoning tasks? While the underlying computational principles are still debated, we hypothesize that a crucial factor is a deficit in visually-grounded serial processing. To test this hypothesis, we compared human and VLM performance across tasks designed to vary serial processing demands in three distinct domains: geometric reasoning, perceptual enumeration, and mental rotation. Tasks within each domain varied serial processing load by manipulating factors such as geometric concept complexity, perceptual individuation load, and transformation difficulty. Across all domains, our results revealed a consistent pattern: decreased VLM accuracy was strongly correlated with increased human reaction time (used as a proxy for serial processing load). As tasks require more demanding serial processing -- whether composing concepts, enumerating items, or performing mental transformations -- the VLM-human performance gap widens reliably. These findings support our hypothesis, indicating that limitations in serial, visually grounded reasoning represent a fundamental bottleneck that distinguishes current VLMs from humans.