SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) face significant bottlenecks in 3D spatial understanding, relying heavily on explicit 3D inputs, custom architectures, or large-scale annotated datasets. To address this, we propose SpatialThinker—a novel framework that leverages a high-quality spatial visual question answering dataset, STVQA-7K, and integrates scene graph modeling with multi-step reasoning. Crucially, it introduces the first multi-objective dense-reward online reinforcement learning paradigm tailored for spatial relational reasoning, effectively unlocking the implicit 3D representational capacity of vision-language models without requiring 3D inputs or extensive supervised data. Experimental results demonstrate that SpatialThinker-7B substantially outperforms both supervised fine-tuning and sparse-reward RL baselines—achieving nearly twofold improvement on spatial understanding and real-world VQA benchmarks—and surpasses GPT-4o. This work establishes a new low-resource, efficient paradigm for spatial reasoning.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets or sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SpatialThinker consists of two key contributions: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial grounding. SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baseline on spatial understanding and real-world VQA benchmarks, nearly doubling the base-model gain compared to sparse RL, and surpassing GPT-4o. These results showcase the effectiveness of combining spatial supervision with reward-aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human-level visual reasoning.

Problem

Research questions and friction points this paper is trying to address.

Addressing poor spatial understanding in multimodal language models

Overcoming reliance on explicit 3D inputs and architecture modifications

Solving limited spatial reasoning with sparse supervision and datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses reinforcement learning for spatial reasoning

Constructs scene graphs for spatial relations

Employs dense spatial rewards for supervision

🔎 Similar Papers

LLMI3D: MLLM-based 3D Perception from a Single 2D Image