🤖 AI Summary
Current multimodal large language models (MLLMs) face significant bottlenecks in 3D spatial understanding, relying heavily on explicit 3D inputs, custom architectures, or large-scale annotated datasets. To address this, we propose SpatialThinker—a novel framework that leverages a high-quality spatial visual question answering dataset, STVQA-7K, and integrates scene graph modeling with multi-step reasoning. Crucially, it introduces the first multi-objective dense-reward online reinforcement learning paradigm tailored for spatial relational reasoning, effectively unlocking the implicit 3D representational capacity of vision-language models without requiring 3D inputs or extensive supervised data. Experimental results demonstrate that SpatialThinker-7B substantially outperforms both supervised fine-tuning and sparse-reward RL baselines—achieving nearly twofold improvement on spatial understanding and real-world VQA benchmarks—and surpasses GPT-4o. This work establishes a new low-resource, efficient paradigm for spatial reasoning.
📝 Abstract
Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets or sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SpatialThinker consists of two key contributions: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial grounding. SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baseline on spatial understanding and real-world VQA benchmarks, nearly doubling the base-model gain compared to sparse RL, and surpassing GPT-4o. These results showcase the effectiveness of combining spatial supervision with reward-aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human-level visual reasoning.