SEM: Enhancing Spatial Understanding for Robust Robot Manipulation

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Dexterous robotic manipulation faces a fundamental challenge in spatial understanding: existing 3D point cloud models lack semantic abstraction, while 2D visual encoders struggle with precise geometric reasoning. To address this, we propose SEM—the first diffusion-based policy framework that jointly integrates 3D spatial enhancement and robot-centric graph encoding. Our key contributions are: (1) a spatial enhancer that explicitly injects geometric context from raw 3D point clouds into the diffusion process; and (2) a joint-aware graph neural network that encodes the robot’s kinematic structure and inter-joint dependencies, enabling semantic-geometric co-reasoning for unified vision–action representation. Evaluated across diverse dexterous manipulation tasks, SEM achieves significant performance gains over state-of-the-art methods. It demonstrates superior generalization and robustness under challenging conditions—including partial occlusion, viewpoint variation, and unseen objects—validating its capacity for real-world deployment.

Technology Category

Application Category

📝 Abstract
A key challenge in robot manipulation lies in developing policy models with strong spatial understanding, the ability to reason about 3D geometry, object relations, and robot embodiment. Existing methods often fall short: 3D point cloud models lack semantic abstraction, while 2D image encoders struggle with spatial reasoning. To address this, we propose SEM (Spatial Enhanced Manipulation model), a novel diffusion-based policy framework that explicitly enhances spatial understanding from two complementary perspectives. A spatial enhancer augments visual representations with 3D geometric context, while a robot state encoder captures embodiment-aware structure through graphbased modeling of joint dependencies. By integrating these modules, SEM significantly improves spatial understanding, leading to robust and generalizable manipulation across diverse tasks that outperform existing baselines.
Problem

Research questions and friction points this paper is trying to address.

Enhancing spatial understanding in robot manipulation policies
Addressing limitations in 3D point cloud and 2D image models
Improving robustness and generalization in diverse manipulation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based policy framework enhances spatial understanding
Spatial enhancer adds 3D geometric context to visuals
Graph-based robot encoder models joint dependencies
X
Xuewu Lin
Robotics Laboratory, Horizon Robotics, Beijing, China
Tianwei Lin
Tianwei Lin
Zhejiang University
MLLMs
Lichao Huang
Lichao Huang
Senior Engineer, Horizon Robotics Inc
Computer VisionMachine Learning
H
Hongyu Xie
Robotics Laboratory, Horizon Robotics, Beijing, China
Y
Yiwei Jin
Robotics Laboratory, Horizon Robotics, Beijing, China
K
Keyu Li
Robotics Laboratory, Horizon Robotics, Beijing, China
Zhizhong Su
Zhizhong Su
Horizon Robotics
Deep LearningComputer VisionAutonomous DrivingRobotics Learning