RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for vision–locomotion mapping in humanoid robots suffer from critical limitations: motion-capture- or text-instruction-based approaches yield semantically sparse, fragmented pipelines, while video imitation methods lack visual understanding and merely replicate poses mechanically. This paper introduces the first end-to-end “video-to-gait” framework, abandoning conventional motion retargeting in favor of a novel “understand-then-imitate” paradigm. Key contributions include: (1) leveraging vision-language models (VLMs) to distill high-level visual-motor intent directly from raw first- or third-person demonstration videos; (2) designing a diffusion-based policy network that maps semantic intent to physically feasible gait control; and (3) enabling fully video-conditioned end-to-end training and deployment. Experiments demonstrate an 80% reduction in third-party video control latency, a 3.7% improvement in task success rate, and support for first-person telepresence control.

Technology Category

Application Category

📝 Abstract
Humans learn locomotion through visual observation, interpreting visual content first before imitating actions. However, state-of-the-art humanoid locomotion systems rely on either curated motion capture trajectories or sparse text commands, leaving a critical gap between visual understanding and control. Text-to-motion methods suffer from semantic sparsity and staged pipeline errors, while video-based approaches only perform mechanical pose mimicry without genuine visual understanding. We propose RoboMirror, the first retargeting-free video-to-locomotion framework embodying "understand before you imitate". Leveraging VLMs, it distills raw egocentric/third-person videos into visual motion intents, which directly condition a diffusion-based policy to generate physically plausible, semantically aligned locomotion without explicit pose reconstruction or retargeting. Extensive experiments validate the effectiveness of RoboMirror, it enables telepresence via egocentric videos, drastically reduces third-person control latency by 80%, and achieves a 3.7% higher task success rate than baselines. By reframing humanoid control around video understanding, we bridge the visual understanding and action gap.
Problem

Research questions and friction points this paper is trying to address.

Bridging visual understanding and humanoid locomotion control
Overcoming semantic sparsity in text-to-motion methods
Eliminating explicit pose reconstruction and retargeting needs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages VLMs to distill video into motion intents
Uses diffusion-based policy for physically plausible locomotion
Eliminates explicit pose reconstruction and retargeting steps
🔎 Similar Papers
No similar papers found.
Z
Zhe Li
BAAI
Cheng Chi
Cheng Chi
Columbia University, Stanford University
robotics
Y
Yangyang Wei
Harbin Institute of Technology
B
Boan Zhu
Hong Kong University of Science and Technology
T
Tao Huang
Shanghai Jiao Tong University
Z
Zhenguo Sun
Yibo Peng
Yibo Peng
Carnegie Mellon University
Code GenerationMultimodal NLPAI Agents
Pengwei Wang
Pengwei Wang
University of Calgary
Computer Science Security
Z
Zhongyuan Wang
University of Sydney
F
Fangzhou Liu
C
Chang Xu
University of Sydney
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models