🤖 AI Summary
Humanoid robots exhibit poor generalization in mobile manipulation tasks within unstructured environments and heavily rely on external motion-capture systems.
Method: This paper proposes a self-centered vision-based hierarchical whole-body control framework. It establishes a simulation-to-real unified visuomotor control pipeline: (i) a task-agnostic keypoint tracker—trained via teacher-student distillation—estimates egocentric pose; (ii) a high-level policy, fusing visual and proprioceptive inputs, generates keypoint-based motor commands; and (iii) human motion priors and action clipping ensure dynamic stability.
Contribution/Results: The framework enables zero-shot cross-task policy transfer without external motion capture. Deployed directly on a real humanoid robot, it successfully executes complex tasks—including box carrying, object pushing, soccer dribbling, and shooting—without fine-tuning. Moreover, it demonstrates strong generalization across diverse outdoor environments.
📝 Abstract
Humanoid loco-manipulation in unstructured environments demands tight integration of egocentric perception and whole-body control. However, existing approaches either depend on external motion capture systems or fail to generalize across diverse tasks. We introduce VisualMimic, a visual sim-to-real framework that unifies egocentric vision with hierarchical whole-body control for humanoid robots. VisualMimic combines a task-agnostic low-level keypoint tracker -- trained from human motion data via a teacher-student scheme -- with a task-specific high-level policy that generates keypoint commands from visual and proprioceptive input. To ensure stable training, we inject noise into the low-level policy and clip high-level actions using human motion statistics. VisualMimic enables zero-shot transfer of visuomotor policies trained in simulation to real humanoid robots, accomplishing a wide range of loco-manipulation tasks such as box lifting, pushing, football dribbling, and kicking. Beyond controlled laboratory settings, our policies also generalize robustly to outdoor environments. Videos are available at: https://visualmimic.github.io .