🤖 AI Summary
This work addresses the low sample efficiency of volumetric grasping models due to their lack of equivariance under vertical-axis rotations. We propose the first vertical-axis rotation-equivariant voxel representation framework specifically designed for robotic grasping. Methodologically, we introduce a tri-plane feature architecture—equivariant to 90° in-plane rotations on the horizontal plane and invariant on the other two orthogonal planes—incorporate an equivariant Implicit Geometric Descriptor (IGD) attention mechanism, and develop a flow-matching-based generative model for rotation-equivariant grasp pose prediction. Additionally, deformable and direction-aware convolutions are employed to enhance geometric modeling fidelity. Experiments demonstrate significant improvements in grasp success rates in both simulation and real-world settings, alongside a 37% reduction in memory footprint and a 29% decrease in computational cost. The approach achieves state-of-the-art performance with only marginal overhead, establishing a novel paradigm for equivariant perception–action joint modeling in robotics.
📝 Abstract
We propose a new volumetric grasp model that is equivariant to rotations around the vertical axis, leading to a significant improvement in sample efficiency. Our model employs a tri-plane volumetric feature representation -- i.e., the projection of 3D features onto three canonical planes. We introduce a novel tri-plane feature design in which features on the horizontal plane are equivariant to 90° rotations, while the sum of features from the other two planes remains invariant to the same transformations. This design is enabled by a new deformable steerable convolution, which combines the adaptability of deformable convolutions with the rotational equivariance of steerable ones. This allows the receptive field to adapt to local object geometry while preserving equivariance properties. We further develop equivariant adaptations of two state-of-the-art volumetric grasp planners, GIGA and IGD. Specifically, we derive a new equivariant formulation of IGD's deformable attention mechanism and propose an equivariant generative model of grasp orientations based on flow matching. We provide a detailed analytical justification of the proposed equivariance properties and validate our approach through extensive simulated and real-world experiments. Our results demonstrate that the proposed projection-based design significantly reduces both computational and memory costs. Moreover, the equivariant grasp models built on top of our tri-plane features consistently outperform their non-equivariant counterparts, achieving higher performance with only a modest computational overhead. Video and code can be viewed in: https://mousecpn.github.io/evg-page/