🤖 AI Summary
Existing multimodal large language models (MLLMs) are primarily evaluated on passive visual perception tasks involving multi-focus image understanding, yet lack systematic assessment of mental visualization—the active construction of internal visual representations to support spatial reasoning, dynamic simulation, and abstract inference. This work introduces VizBench, the first synthetic benchmark explicitly designed for evaluating mental visualization. It comprises four categories of procedurally generated visual puzzles covering abstract reasoning, spatial imagination, and scene simulation, enabling controllable difficulty scaling and interpretable evaluation. Methodologically, we integrate procedural content generation, human–model comparative experiments, and reinforcement learning–guided visual reasoning optimization. Empirical results reveal that state-of-the-art MLLMs underperform significantly relative to humans on VizBench, exposing fundamental limitations in their capacity for active visual construction. This work fills a critical gap in multimodal cognitive evaluation and establishes a novel paradigm—along with concrete improvement directions—for modeling visual mental representation in foundation models.
📝 Abstract
Mental visualization, the ability to construct and manipulate visual representations internally, is a core component of human cognition and plays a vital role in tasks involving reasoning, prediction, and abstraction. Despite the rapid progress of Multimodal Large Language Models (MLLMs), current benchmarks primarily assess passive visual perception, offering limited insight into the more active capability of internally constructing visual patterns to support problem solving. Yet mental visualization is a critical cognitive skill in humans, supporting abilities such as spatial navigation, predicting physical trajectories, and solving complex visual problems through imaginative simulation. To bridge this gap, we introduce Hyperphantasia, a synthetic benchmark designed to evaluate the mental visualization abilities of MLLMs through four carefully constructed puzzles. Each task is procedurally generated and presented at three difficulty levels, enabling controlled analysis of model performance across increasing complexity. Our comprehensive evaluation of state-of-the-art models reveals a substantial gap between the performance of humans and MLLMs. Additionally, we explore the potential of reinforcement learning to improve visual simulation capabilities. Our findings suggest that while some models exhibit partial competence in recognizing visual patterns, robust mental visualization remains an open challenge for current MLLMs.