🤖 AI Summary
This work addresses core challenges in whole-body control of humanoid robots—high-dimensional action spaces, bipedal instability, and difficulties in end-to-end visual learning—by proposing a hierarchical world model architecture that requires no handcrafted rewards, simplifying assumptions, or skill priors. The architecture decouples high-level visual decision-making from low-level motor execution and jointly optimizes both via reinforcement learning. Evaluated on a 56-DoF simulated humanoid platform (Isaac Gym), the system achieves multi-task generalization using raw visual inputs only. It attains high-performance policies across eight complex tasks, with motion quality rated significantly superior to existing baselines by human evaluators. To our knowledge, this is the first demonstration of end-to-end, multi-task, generalizable whole-body control for high-DOF humanoid robots driven solely by vision.
📝 Abstract
Whole-body control for humanoids is challenging due to the high-dimensional nature of the problem, coupled with the inherent instability of a bipedal morphology. Learning from visual observations further exacerbates this difficulty. In this work, we explore highly data-driven approaches to visual whole-body humanoid control based on reinforcement learning, without any simplifying assumptions, reward design, or skill primitives. Specifically, we propose a hierarchical world model in which a high-level agent generates commands based on visual observations for a low-level agent to execute, both of which are trained with rewards. Our approach produces highly performant control policies in 8 tasks with a simulated 56-DoF humanoid, while synthesizing motions that are broadly preferred by humans.