SpatialTree: How Spatial Abilities Branch Out in MLLMs

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Prior work lacks a systematic characterization of spatial capability development in multimodal large language models (MLLMs), relying on narrow-task evaluations. Method: Inspired by cognitive science, we propose a four-layer spatial capability hierarchy—L1 perception, L2 mental mapping, L3 simulation, and L4 agent interaction—and introduce the first capability-centered, hierarchical benchmark to systematically evaluate 27 fine-grained sub-capabilities and uncover their dependency structure. We establish the first spatial capability taxonomy, identify cross-level positive transfer and intra-L1 negative transfer, and propose Auto-Think—a prompting strategy that mitigates perceptual degradation from overthinking in reinforcement learning (RL). Results: Experiments confirm L1 orthogonality and strong correlations among higher-level capabilities; joint four-layer improvement is achieved. Auto-Think+RL delivers stable, significant gains across all layers, outperforming all baselines.

Technology Category

Application Category

📝 Abstract

Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Organizes spatial abilities into a cognitive hierarchy for MLLMs

Evaluates MLLMs across hierarchical spatial skills to reveal interdependencies

Proposes methods to improve spatial reasoning while avoiding negative transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical taxonomy organizes spatial abilities into four levels

Capability-centric benchmark evaluates 27 sub-abilities across levels

Auto-think strategy suppresses unnecessary deliberation for consistent improvement

🔎 Similar Papers

No similar papers found.

Nuro

$183,825 and $333,925

Mountain View, California (HQ)

Senior Research Scientist (Multimodal Large Language Model) - PICO

ByteDance

圣何塞

AI Research Scientist, VLM (vision language models)