🤖 AI Summary
State-of-the-art large language models exhibit significant limitations in spatial reasoning tasks requiring mental simulation, such as mental rotation, revealing a fundamental deficiency in basic visuospatial representation. This work proposes a dual-module architecture that integrates a multimodal large language model with a programmable 3D rendering engine—serving as an external “imagery module”—to investigate whether offloading visual state maintenance can compensate for these spatial reasoning shortcomings. Experimental results demonstrate that even with precise 3D rotation and rendering support, model performance plateaus at a maximum accuracy of 62.5%, exposing inherent limitations in depth perception, motion understanding, dynamic prediction, and reflective image-based reasoning. This study provides the first systematic evidence that current models struggle to effectively leverage external visual imagery, underscoring the necessity of incorporating low-level visuospatial primitives into their architectures.
📝 Abstract
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet they struggle with spatial tasks that require mental simulation, such as mental rotation. This paper investigates whether equipping an LLM with an external ``Imagery Module'' -- a tool capable of rendering and rotating 3D models -- can bridge this gap, functioning as a ``cognitive prosthetic.'' We conducted experiments using a dual-module architecture in which a reasoning module (an MLLM) interacts with an imagery module on 3D model rotation tasks. Performance was lower than expected, with accuracy reaching at most 62.5%. Further investigation suggests that even when the burden of maintaining and manipulating a holistic 3D state is outsourced, the system still fails. This reveals that current frontier models lack the foundational visual-spatial primitives required to interface with imagery. Specifically, they lack: (1) the low-level sensitivity to extract spatial signals such as (a) depth, (b) motion, and (c) short-horizon dynamic prediction; and (2) the capacity to reason contemplatively over images, dynamically shifting visual focus and balancing imagery with symbolic and associative information.