🤖 AI Summary
Existing equivariant diffusion policies require multi-view point-cloud inputs, making them incompatible with mainstream monocular RGB cameras (e.g., GoPro). This limits applicability to eye-in-hand robotic manipulation under monocular vision.
Method: We propose the first SO(3)-equivariant diffusion policy framework for monocular RGB input—bypassing explicit 3D reconstruction. Instead, 2D image features are spherical-projected onto the unit sphere S² and processed by an SO(3)-equivariant CNN to inherently model rotational symmetry.
Contribution/Results: Our approach is the first to achieve purely vision-driven SO(3) equivariance learning without multi-view geometric priors. Jointly trained in simulation and on real robots, it significantly improves sample efficiency and task success rates, consistently outperforming strong baselines across both domains.
📝 Abstract
Equivariant models have recently been shown to improve the data efficiency of diffusion policy by a significant margin. However, prior work that explored this direction focused primarily on point cloud inputs generated by multiple cameras fixed in the workspace. This type of point cloud input is not compatible with the now-common setting where the primary input modality is an eye-in-hand RGB camera like a GoPro. This paper closes this gap by incorporating into the diffusion policy model a process that projects features from the 2D RGB camera image onto a sphere. This enables us to reason about symmetries in SO(3) without explicitly reconstructing a point cloud. We perform extensive experiments in both simulation and the real world that demonstrate that our method consistently outperforms strong baselines in terms of both performance and sample efficiency. Our work is the first SO(3)-equivariant policy learning framework for robotic manipulation that works using only monocular RGB inputs.