🤖 AI Summary
This work addresses the challenge of transferring dexterous manipulation skills across heterogeneous robotic hands in imitation learning, where reliance on 2D observations and temporally ordered action representations hinders generalization. To overcome this limitation, the authors propose the Structured Action Transformer (SAT), which takes 3D point clouds as input and models actions as unordered, variable-length sets of joint trajectories. SAT introduces an embodied joint codebook that encodes functional and kinematic properties of joints, and employs a continuous-time flow matching objective for policy learning. By departing from conventional sequential action modeling and leveraging the Transformer architecture with large-scale pretraining on heterogeneous data, SAT significantly improves sample efficiency and cross-embodiment skill transfer in both simulated and real-world tasks.
📝 Abstract
Achieving human-level dexterity in robots via imitation learning from heterogeneous datasets is hindered by the challenge of cross-embodiment skill transfer, particularly for high-DoF robotic hands. Existing methods, often relying on 2D observations and temporal-centric action representation, struggle to capture 3D spatial relations and fail to handle embodiment heterogeneity. This paper proposes the Structural Action Transformer (SAT), a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective. We reframe each action chunk not as a temporal sequence, but as a variable-length, unordered sequence of joint-wise trajectories. This structural formulation allows a Transformer to natively handle heterogeneous embodiments, treating the joint count as a variable sequence length. To encode structural priors and resolve ambiguity, we introduce an Embodied Joint Codebook that embeds each joint's functional role and kinematic properties. Our model learns to generate these trajectories from 3D point clouds via a continuous-time flow matching objective. We validate our approach by pre-training on large-scale heterogeneous datasets and fine-tuning on simulation and real-world dexterous manipulation tasks. Our method consistently outperforms all baselines, demonstrating superior sample efficiency and effective cross-embodiment skill transfer. This structural-centric representation offers a new path toward scaling policies for high-DoF, heterogeneous manipulators.