🤖 AI Summary
Existing quadruped locomotion policies heavily rely on multi-stage knowledge distillation (e.g., teacher–student frameworks), requiring pre-trained teacher models and privileged information—resulting in low efficiency and poor transferability. Method: We propose the Unified Locomotion Transformer (ULT), the first single-stage, end-to-end Transformer architecture that jointly integrates privileged-information-guided policy learning, deep reinforcement learning, self-supervised next-state–action prediction, and behavioral cloning. ULT eliminates sequential distillation, instead co-optimizing a high-performance teacher policy and a lightweight student policy within one unified training objective. Contribution/Results: ULT enables zero-shot sim-to-real deployment without fine-tuning. Experiments demonstrate significantly reduced cross-domain transfer difficulty on complex terrains, achieving state-of-the-art generalization performance in both simulation and on real quadruped robots.
📝 Abstract
Quadrupeds have gained rapid advancement in their capability of traversing across complex terrains. The adoption of deep Reinforcement Learning (RL), transformers and various knowledge transfer techniques can greatly reduce the sim-to-real gap. However, the classical teacher-student framework commonly used in existing locomotion policies requires a pre-trained teacher and leverages the privilege information to guide the student policy. With the implementation of large-scale models in robotics controllers, especially transformers-based ones, this knowledge distillation technique starts to show its weakness in efficiency, due to the requirement of multiple supervised stages. In this paper, we propose Unified Locomotion Transformer (ULT), a new transformer-based framework to unify the processes of knowledge transfer and policy optimization in a single network while still taking advantage of privilege information. The policies are optimized with reinforcement learning, next state-action prediction, and action imitation, all in just one training stage, to achieve zero-shot deployment. Evaluation results demonstrate that with ULT, optimal teacher and student policies can be obtained at the same time, greatly easing the difficulty in knowledge transfer, even with complex transformer-based models.