🤖 AI Summary
Humanoid robots struggle to learn generalizable and agile interaction skills from limited data, and existing approaches often rely on task-specific rewards, limiting their scalability. To address this challenge, this work proposes HumanX, a framework that enables zero-shot transfer of complex interactive skills from a single human demonstration video to a physical humanoid robot for the first time. HumanX comprises the XGen data generation pipeline, which automatically produces physically plausible and diverse interaction data from video, and the XMimic unified imitation learning module, which integrates physics-based simulation with behavioral cloning—eliminating the need for task-specific rewards or extensive real-world interaction data. The framework successfully acquires ten distinct skills across five domains and demonstrates high-complexity tasks such as continuous ball passing and intricate shooting on the Unitree G1 robot, achieving a zero-shot generalization success rate over eight times higher than current methods.
📝 Abstract
Enabling humanoid robots to perform agile and adaptive interactive tasks has long been a core challenge in robotics. Current approaches are bottlenecked by either the scarcity of realistic interaction data or the need for meticulous, task-specific reward engineering, which limits their scalability. To narrow this gap, we present HumanX, a full-stack framework that compiles human video into generalizable, real-world interaction skills for humanoids, without task-specific rewards. HumanX integrates two co-designed components: XGen, a data generation pipeline that synthesizes diverse and physically plausible robot interaction data from video while supporting scalable data augmentation; and XMimic, a unified imitation learning framework that learns generalizable interaction skills. Evaluated across five distinct domains--basketball, football, badminton, cargo pickup, and reactive fighting--HumanX successfully acquires 10 different skills and transfers them zero-shot to a physical Unitree G1 humanoid. The learned capabilities include complex maneuvers such as pump-fake turnaround fadeaway jumpshots without any external perception, as well as interactive tasks like sustained human-robot passing sequences over 10 consecutive cycles--learned from a single video demonstration. Our experiments show that HumanX achieves over 8 times higher generalization success than prior methods, demonstrating a scalable and task-agnostic pathway for learning versatile, real-world robot interactive skills.