VisualMimic: Visual Humanoid Loco-Manipulation via Motion Tracking and Generation

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Humanoid robots exhibit poor generalization in mobile manipulation tasks within unstructured environments and heavily rely on external motion-capture systems. Method: This paper proposes a self-centered vision-based hierarchical whole-body control framework. It establishes a simulation-to-real unified visuomotor control pipeline: (i) a task-agnostic keypoint tracker—trained via teacher-student distillation—estimates egocentric pose; (ii) a high-level policy, fusing visual and proprioceptive inputs, generates keypoint-based motor commands; and (iii) human motion priors and action clipping ensure dynamic stability. Contribution/Results: The framework enables zero-shot cross-task policy transfer without external motion capture. Deployed directly on a real humanoid robot, it successfully executes complex tasks—including box carrying, object pushing, soccer dribbling, and shooting—without fine-tuning. Moreover, it demonstrates strong generalization across diverse outdoor environments.

Technology Category

Application Category

📝 Abstract

Humanoid loco-manipulation in unstructured environments demands tight integration of egocentric perception and whole-body control. However, existing approaches either depend on external motion capture systems or fail to generalize across diverse tasks. We introduce VisualMimic, a visual sim-to-real framework that unifies egocentric vision with hierarchical whole-body control for humanoid robots. VisualMimic combines a task-agnostic low-level keypoint tracker -- trained from human motion data via a teacher-student scheme -- with a task-specific high-level policy that generates keypoint commands from visual and proprioceptive input. To ensure stable training, we inject noise into the low-level policy and clip high-level actions using human motion statistics. VisualMimic enables zero-shot transfer of visuomotor policies trained in simulation to real humanoid robots, accomplishing a wide range of loco-manipulation tasks such as box lifting, pushing, football dribbling, and kicking. Beyond controlled laboratory settings, our policies also generalize robustly to outdoor environments. Videos are available at: https://visualmimic.github.io .

Problem

Research questions and friction points this paper is trying to address.

Enabling humanoid robots to perform loco-manipulation in unstructured environments

Overcoming dependence on external motion capture systems for humanoid control

Achieving generalization across diverse manipulation tasks using visual input

Innovation

Methods, ideas, or system contributions that make the work stand out.

Keypoint tracker trained via teacher-student scheme

Hierarchical policy generating commands from visual input

Noise injection and action clipping for stable training

🔎 Similar Papers

Omnigrasp: Grasping Diverse Objects with Simulated Humanoids