SEM: Enhancing Spatial Understanding for Robust Robot Manipulation

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Dexterous robotic manipulation faces a fundamental challenge in spatial understanding: existing 3D point cloud models lack semantic abstraction, while 2D visual encoders struggle with precise geometric reasoning. To address this, we propose SEM—the first diffusion-based policy framework that jointly integrates 3D spatial enhancement and robot-centric graph encoding. Our key contributions are: (1) a spatial enhancer that explicitly injects geometric context from raw 3D point clouds into the diffusion process; and (2) a joint-aware graph neural network that encodes the robot’s kinematic structure and inter-joint dependencies, enabling semantic-geometric co-reasoning for unified vision–action representation. Evaluated across diverse dexterous manipulation tasks, SEM achieves significant performance gains over state-of-the-art methods. It demonstrates superior generalization and robustness under challenging conditions—including partial occlusion, viewpoint variation, and unseen objects—validating its capacity for real-world deployment.

Technology Category

Application Category

📝 Abstract

A key challenge in robot manipulation lies in developing policy models with strong spatial understanding, the ability to reason about 3D geometry, object relations, and robot embodiment. Existing methods often fall short: 3D point cloud models lack semantic abstraction, while 2D image encoders struggle with spatial reasoning. To address this, we propose SEM (Spatial Enhanced Manipulation model), a novel diffusion-based policy framework that explicitly enhances spatial understanding from two complementary perspectives. A spatial enhancer augments visual representations with 3D geometric context, while a robot state encoder captures embodiment-aware structure through graphbased modeling of joint dependencies. By integrating these modules, SEM significantly improves spatial understanding, leading to robust and generalizable manipulation across diverse tasks that outperform existing baselines.

Problem

Research questions and friction points this paper is trying to address.

Enhancing spatial understanding in robot manipulation policies

Addressing limitations in 3D point cloud and 2D image models

Improving robustness and generalization in diverse manipulation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based policy framework enhances spatial understanding

Spatial enhancer adds 3D geometric context to visuals

Graph-based robot encoder models joint dependencies

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey