🤖 AI Summary
This work addresses the challenge of achieving robust, multimodal whole-body control for humanoid robots in real-world environments. We propose the Masked Humanoid Controller (MHC), a learning-based framework enabling stable standing, natural bipedal locomotion, and partial- or full-body motion imitation—while maintaining balance and resisting external disturbances. Our key contributions include: (i) the first learning-based multimodal whole-body controller demonstrating successful sim-to-real transfer on the Digit V3 robot; (ii) a state-subset masking mechanism for trajectory tracking, supporting partial trajectory specification and seamless integration with handheld joystick control; and (iii) a behavior-library-driven progressive curriculum learning strategy that unifies motion-capture data, video-based motion retargeting, optimization-generated trajectories, and human-annotated demonstrations. Extensive simulation and real-robot experiments validate MHC’s ability to execute diverse, user-specified commands reliably and in real time.
📝 Abstract
The foundational capabilities of humanoid robots should include robustly standing, walking, and mimicry of whole and partial-body motions. This work introduces the Masked Humanoid Controller (MHC), which supports all of these capabilities by tracking target trajectories over selected subsets of humanoid state variables while ensuring balance and robustness against disturbances. The MHC is trained in simulation using a carefully designed curriculum that imitates partially masked motions from a library of behaviors spanning standing, walking, optimized reference trajectories, re-targeted video clips, and human motion capture data. It also allows for combining joystick-based control with partial-body motion mimicry. We showcase simulation experiments validating the MHC's ability to execute a wide variety of behaviors from partially-specified target motions. Moreover, we demonstrate sim-to-real transfer on the real-world Digit V3 humanoid robot. To our knowledge, this is the first instance of a learned controller that can realize whole-body control of a real-world humanoid for such diverse multi-modal targets.