3D Equivariant Visuomotor Policy Learning via Spherical Projection

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing equivariant diffusion policies require multi-view point-cloud inputs, making them incompatible with mainstream monocular RGB cameras (e.g., GoPro). This limits applicability to eye-in-hand robotic manipulation under monocular vision. Method: We propose the first SO(3)-equivariant diffusion policy framework for monocular RGB input—bypassing explicit 3D reconstruction. Instead, 2D image features are spherical-projected onto the unit sphere S² and processed by an SO(3)-equivariant CNN to inherently model rotational symmetry. Contribution/Results: Our approach is the first to achieve purely vision-driven SO(3) equivariance learning without multi-view geometric priors. Jointly trained in simulation and on real robots, it significantly improves sample efficiency and task success rates, consistently outperforming strong baselines across both domains.

Technology Category

Application Category

📝 Abstract

Equivariant models have recently been shown to improve the data efficiency of diffusion policy by a significant margin. However, prior work that explored this direction focused primarily on point cloud inputs generated by multiple cameras fixed in the workspace. This type of point cloud input is not compatible with the now-common setting where the primary input modality is an eye-in-hand RGB camera like a GoPro. This paper closes this gap by incorporating into the diffusion policy model a process that projects features from the 2D RGB camera image onto a sphere. This enables us to reason about symmetries in SO(3) without explicitly reconstructing a point cloud. We perform extensive experiments in both simulation and the real world that demonstrate that our method consistently outperforms strong baselines in terms of both performance and sample efficiency. Our work is the first SO(3)-equivariant policy learning framework for robotic manipulation that works using only monocular RGB inputs.

Problem

Research questions and friction points this paper is trying to address.

Develops SO(3)-equivariant policy for monocular RGB inputs

Projects 2D RGB features onto sphere without point clouds

Improves robotic manipulation performance and sample efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Projects 2D RGB features onto a sphere

Uses SO(3)-equivariant policy learning

Works with monocular RGB inputs only

🔎 Similar Papers

Omnigrasp: Grasping Diverse Objects with Simulated Humanoids

2024-07-16Neural Information Processing SystemsCitations: 16