🤖 AI Summary
Prior work lacks a systematic characterization of spatial capability development in multimodal large language models (MLLMs), relying on narrow-task evaluations. Method: Inspired by cognitive science, we propose a four-layer spatial capability hierarchy—L1 perception, L2 mental mapping, L3 simulation, and L4 agent interaction—and introduce the first capability-centered, hierarchical benchmark to systematically evaluate 27 fine-grained sub-capabilities and uncover their dependency structure. We establish the first spatial capability taxonomy, identify cross-level positive transfer and intra-L1 negative transfer, and propose Auto-Think—a prompting strategy that mitigates perceptual degradation from overthinking in reinforcement learning (RL). Results: Experiments confirm L1 orthogonality and strong correlations among higher-level capabilities; joint four-layer improvement is achieved. Auto-Think+RL delivers stable, significant gains across all layers, outperforming all baselines.
📝 Abstract
Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.