🤖 AI Summary
Current keyframe imitation learning (IL) methods neglect inherent spatial symmetries in robotic manipulation tasks, resulting in poor sample efficiency and limited generalization. This work identifies and formalizes the *dual equivariance* of keyframe action policies—specifically, equivariance under both workspace translations/rotations and gripper–object relative pose transformations. We propose a coarse-to-fine SE(3) action evaluation mechanism that decouples translation and rotation modeling while preserving joint optimization. Building upon Transporter Networks, we introduce the 3D Keyframe Transporter, which integrates cross-correlation-based feature matching, dual-equivariant feature encoding, and hierarchical SE(3) pose search. Evaluated across multiple simulated manipulation tasks, our method achieves an average performance gain of over 10%; on real-robot experiments, it attains an average improvement of 55%—significantly outperforming state-of-the-art keyframe IL baselines.
📝 Abstract
Recent advances in Keyframe Imitation Learning (IL) have enabled learning-based agents to solve a diverse range of manipulation tasks. However, most approaches ignore the rich symmetries in the problem setting and, as a consequence, are sample-inefficient. This work identifies and utilizes the bi-equivariant symmetry within Keyframe IL to design a policy that generalizes to transformations of both the workspace and the objects grasped by the gripper. We make two main contributions: First, we analyze the bi-equivariance properties of the keyframe action scheme and propose a Keyframe Transporter derived from the Transporter Networks, which evaluates actions using cross-correlation between the features of the grasped object and the features of the scene. Second, we propose a computationally efficient coarse-to-fine SE(3) action evaluation scheme for reasoning the intertwined translation and rotation action. The resulting method outperforms strong Keyframe IL baselines by an average of>10% on a wide range of simulation tasks, and by an average of 55% in 4 physical experiments.