Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

📅 2026-04-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
This work addresses the pronounced deficiencies of multimodal large language models in visual cognition and spatial reasoning by introducing the first multiple-choice benchmark grounded in the “Abstraction–Relation–Transformation” (A-R-T) taxonomy. Inspired by human intelligence tests and informed by cognitive science, the benchmark comprises eight standardized tasks designed to systematically evaluate core facets of fluid intelligence—namely visual abstraction, relational reasoning, and mental transformation. Validation through human control experiments and error attribution analyses demonstrates that while humans achieve an average accuracy of 80%, even state-of-the-art models remain below 50%, exposing fundamental limitations in visual attention, internal representational manipulation, and conceptual abstraction.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce "Mind's Eye", a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel "A-R-T" taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.
Problem

Research questions and friction points this paper is trying to address.

visuospatial reasoning
visual abstraction
multimodal LLMs
fluid intelligence
cognitive evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual abstraction
visuospatial reasoning
multimodal LLMs
cognitive benchmark
A-R-T taxonomy