🤖 AI Summary
Dexterous robotic manipulation faces a fundamental challenge in spatial understanding: existing 3D point cloud models lack semantic abstraction, while 2D visual encoders struggle with precise geometric reasoning. To address this, we propose SEM—the first diffusion-based policy framework that jointly integrates 3D spatial enhancement and robot-centric graph encoding. Our key contributions are: (1) a spatial enhancer that explicitly injects geometric context from raw 3D point clouds into the diffusion process; and (2) a joint-aware graph neural network that encodes the robot’s kinematic structure and inter-joint dependencies, enabling semantic-geometric co-reasoning for unified vision–action representation. Evaluated across diverse dexterous manipulation tasks, SEM achieves significant performance gains over state-of-the-art methods. It demonstrates superior generalization and robustness under challenging conditions—including partial occlusion, viewpoint variation, and unseen objects—validating its capacity for real-world deployment.
📝 Abstract
A key challenge in robot manipulation lies in developing policy models with strong spatial understanding, the ability to reason about 3D geometry, object relations, and robot embodiment. Existing methods often fall short: 3D point cloud models lack semantic abstraction, while 2D image encoders struggle with spatial reasoning. To address this, we propose SEM (Spatial Enhanced Manipulation model), a novel diffusion-based policy framework that explicitly enhances spatial understanding from two complementary perspectives. A spatial enhancer augments visual representations with 3D geometric context, while a robot state encoder captures embodiment-aware structure through graphbased modeling of joint dependencies. By integrating these modules, SEM significantly improves spatial understanding, leading to robust and generalizable manipulation across diverse tasks that outperform existing baselines.