Spatial Policy: Guiding Visuomotor Robotic Manipulation with Spatial-Aware Modeling and Reasoning

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-action policies lack explicit spatial modeling capabilities, hindering reliable translation of visual plans into executable control in complex embodied environments. To address this, we propose Spatial Policy (SP), the first spatially aware modeling framework that jointly learns visual prediction and action execution through three key innovations: (1) spatially conditioned video generation for visual forecasting, (2) a spatial-layout-aware action prediction network, and (3) a two-stage feedback-driven replanning mechanism enabling co-optimization of vision-action learning and spatial logical reasoning. SP introduces a novel *spatial plan table*—a unified representation encoding both visual intent and action constraints—thereby significantly improving spatial relational understanding and online error correction. Evaluated on 11 challenging embodied tasks, SP achieves a mean success rate of 86.7%, outperforming the strongest baseline by 33.0 percentage points, and substantially enhancing practical utility and robustness in real-world scenarios.

Technology Category

Application Category

📝 Abstract
Vision-centric hierarchical embodied models have demonstrated strong potential for long-horizon robotic control. However, existing methods lack spatial awareness capabilities, limiting their effectiveness in bridging visual plans to actionable control in complex environments. To address this problem, we propose Spatial Policy (SP), a unified spatial-aware visuomotor robotic manipulation framework via explicit spatial modeling and reasoning. Specifically, we first design a spatial-conditioned embodied video generation module to model spatially guided predictions through a spatial plan table. Then, we propose a spatial-based action prediction module to infer executable actions with coordination. Finally, we propose a spatial reasoning feedback policy to refine the spatial plan table via dual-stage replanning. Extensive experiments show that SP significantly outperforms state-of-the-art baselines, achieving a 33.0% average improvement over the best baseline. With an 86.7% average success rate across 11 diverse tasks, SP substantially enhances the practicality of embodied models for robotic control applications. Code and checkpoints are maintained at https://plantpotatoonmoon.github.io/SpatialPolicy/.
Problem

Research questions and friction points this paper is trying to address.

Lacks spatial awareness in robotic vision models
Bridging visual plans to actionable robotic control
Enhancing spatial modeling for complex manipulation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial-conditioned embodied video generation module
Spatial-based action prediction with coordination
Dual-stage replanning spatial reasoning feedback policy
🔎 Similar Papers
No similar papers found.
Y
Yijun Liu
Tsinghua University
Yuwei Liu
Yuwei Liu
PhD student, Institute of Software Chinese Academy of Sciences
Computer ScienceSoftware and System Security
Y
Yuan Meng
Tsinghua University
J
Jieheng Zhang
Guangzhou University
Yuwei Zhou
Yuwei Zhou
Tsinghua University
Y
Ye Li
Tsinghua University
J
Jiacheng Jiang
Tsinghua University
K
Kangye Ji
Tsinghua University
Shijia Ge
Shijia Ge
Tsinghua University
Machine LearningAI3DVRoboticsAI4Med
Z
Zhi Wang
Tsinghua University
Wenwu Zhu
Wenwu Zhu
Professor, Computer Science, Tsinghua Univerisity
Multimedia ComputingNetwork Representation Learning