π€ AI Summary
This study investigates whether state-of-the-art vision-language models possess human-like spatial visualization capabilities, with a focus on operations such as symmetry transformations and rotations. To this end, we introduce MentalBlackboard, an open benchmark that formalizes spatial visualization as evaluable mathematical transformation tasks and establishes a dual-dimensional evaluation framework encompassing both prediction and planning. Using a newly curated dataset of origami and hole-punching tasks, we conduct end-to-end evaluations on models including Claude Opus 4.1 and o3. Results show that while o3 achieves 71.6% accuracy on generalization tasks, its performance drops to only 25% on text-based prediction tasks requiring spatial imagination. Similarly, Claude Opus 4.1 attains at most 10% accuracy on planning tasks, revealing fundamental limitations in current modelsβ capacity for symmetry reasoning and understanding physical contexts.
π Abstract
Spatial visualization is the mental ability to imagine, transform, and manipulate the spatial characteristics of objects and actions. This intelligence is a part of human cognition where actions and perception are connected on a mental level. To explore whether state-of-the-art Vision-Language Models (VLMs) exhibit this ability, we develop MentalBlackboard, an open-ended spatial visualization benchmark for Paper Folding and Hole Punching tests within two core tasks: prediction and planning. Our prediction experiments reveal that models struggle with applying symmetrical transformations, even when they predict the sequence of unfolding steps correctly. Also, rotations introduce a significant challenge to the physical situational awareness for models. The planning task reveals limitations of models in analyzing symmetrical relationships and in implementing the multi-stage symmetry process, with Claude Opus 4.1 achieving the highest planning score at an accuracy of 10\%. The top-performing model, o3, attains a peak performance of 71.6\% on the generalization task, which does not require spatial visualization but transfers spatial data; however, it achieves only 25\% accuracy on text-based prediction tasks.