🤖 AI Summary
This work addresses the challenge of deploying agile multi-contact locomotion on humanoid robots, which typically requires extensive skill-specific customization and parameter tuning. The authors propose ZEST, a framework that leverages reinforcement learning to train whole-body control policies end-to-end from diverse motion data—including motion capture, monocular video, and animation—without relying on contact labels, reference windows, state estimators, or intricate reward shaping. By integrating adaptive sampling with a model-based auxiliary torque curriculum, ZEST achieves zero-shot generalization across behaviors and platforms under moderate domain randomization in simulation. Experiments demonstrate successful reproduction of complex multi-contact skills such as crawling and breakdancing on the Atlas robot, direct transfer of dance and box-jumping motions from video to both Atlas and Unitree G1, and even consecutive backflips on the quadrupedal Spot robot, highlighting its strong cross-modal and cross-morphology generalization capabilities.
📝 Abstract
Achieving robust, human-like whole-body control on humanoid robots for agile, contact-rich behaviors remains a central challenge, demanding heavy per-skill engineering and a brittle process of tuning controllers. We introduce ZEST (Zero-shot Embodied Skill Transfer), a streamlined motion-imitation framework that trains policies via reinforcement learning from diverse sources -- high-fidelity motion capture, noisy monocular video, and non-physics-constrained animation -- and deploys them to hardware zero-shot. ZEST generalizes across behaviors and platforms while avoiding contact labels, reference or observation windows, state estimators, and extensive reward shaping. Its training pipeline combines adaptive sampling, which focuses training on difficult motion segments, and an automatic curriculum using a model-based assistive wrench, together enabling dynamic, long-horizon maneuvers. We further provide a procedure for selecting joint-level gains from approximate analytical armature values for closed-chain actuators, along with a refined model of actuators. Trained entirely in simulation with moderate domain randomization, ZEST demonstrates remarkable generality. On Boston Dynamics'Atlas humanoid, ZEST learns dynamic, multi-contact skills (e.g., army crawl, breakdancing) from motion capture. It transfers expressive dance and scene-interaction skills, such as box-climbing, directly from videos to Atlas and the Unitree G1. Furthermore, it extends across morphologies to the Spot quadruped, enabling acrobatics, such as a continuous backflip, through animation. Together, these results demonstrate robust zero-shot deployment across heterogeneous data sources and embodiments, establishing ZEST as a scalable interface between biological movements and their robotic counterparts.