🤖 AI Summary
Humanoid robots require full-body motion control that simultaneously ensures diversity, robustness, and generalization in human-centered, complex environments. Existing teacher-student frameworks suffer from diversity loss during policy distillation and exhibit limited generalization to unseen motions. To address these limitations, we propose a CVAE-enhanced student policy framework: a conditional variational autoencoder (CVAE) is embedded within the policy network to explicitly model the latent diversity of human motions, and DAgger-based imitation learning is employed to improve adaptability and stability under partial observability. The resulting policy enables high-fidelity tracking of diverse motion sequences using a single unified model. In both simulation and real-robot experiments, our approach significantly outperforms an MLP baseline in motion quality, cross-motion generalization, and deployment robustness.
📝 Abstract
Humanoid robots must achieve diverse, robust, and generalizable whole-body control to operate effectively in complex, human-centric environments. However, existing methods, particularly those based on teacher-student frameworks often suffer from a loss of motion diversity during policy distillation and exhibit limited generalization to unseen behaviors. In this work, we present UniTracker, a simplified yet powerful framework that integrates a Conditional Variational Autoencoder (CVAE) into the student policy to explicitly model the latent diversity of human motion. By leveraging a learned CVAE prior, our method enables the student to retain expressive motion characteristics while improving robustness and adaptability under partial observations. The result is a single policy capable of tracking a wide spectrum of whole-body motions with high fidelity and stability. Comprehensive experiments in both simulation and real-world deployments demonstrate that UniTracker significantly outperforms MLP-based DAgger baselines in motion quality, generalization to unseen references, and deployment robustness, offering a practical and scalable solution for expressive humanoid control.