MimicDreamer: Aligning Human and Robot Demonstrations for Scalable VLA Training

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

To address visual, viewpoint, and kinematic domain shifts between human demonstration videos and robotic execution, this paper proposes MimicDreamer—the first framework enabling joint alignment across vision, viewpoint, and action domains. Methodologically: (1) the H2R Aligner integrates video diffusion models, homography-based geometric transformation, and image inpainting to synthesize high-fidelity robot-view videos; (2) EgoStabilizer stabilizes egocentric camera motion for consistent first-person perspective; and (3) hand trajectory mapping coupled with a constrained inverse kinematics solver ensures precise action transfer. Critically, MimicDreamer trains vision-language-action (VLA) policies exclusively on synthetic human-to-robot videos—eliminating the need for real robot demonstration data. Evaluated on six manipulation tasks, the resulting policies achieve an average success rate 14.7% higher than those trained solely on real robot data, demonstrating effective few-shot deployment on physical arms while substantially reducing real-world data collection overhead.

Technology Category

Application Category

📝 Abstract

Vision Language Action (VLA) models derive their generalization capability from diverse training data, yet collecting embodied robot interaction data remains prohibitively expensive. In contrast, human demonstration videos are far more scalable and cost-efficient to collect, and recent studies confirm their effectiveness in training VLA models. However, a significant domain gap persists between human videos and robot-executed videos, including unstable camera viewpoints, visual discrepancies between human hands and robotic arms, and differences in motion dynamics. To bridge this gap, we propose MimicDreamer, a framework that turns fast, low-cost human demonstrations into robot-usable supervision by jointly aligning vision, viewpoint, and actions to directly support policy training. For visual alignment, we propose H2R Aligner, a video diffusion model that generates high-fidelity robot demonstration videos by transferring motion from human manipulation footage. For viewpoint stabilization, EgoStabilizer is proposed, which canonicalizes egocentric videos via homography and inpaints occlusions and distortions caused by warping. For action alignment, we map human hand trajectories to the robot frame and apply a constrained inverse kinematics solver to produce feasible, low-jitter joint commands with accurate pose tracking. Empirically, VLA models trained purely on our synthesized human-to-robot videos achieve few-shot execution on real robots. Moreover, scaling training with human data significantly boosts performance compared to models trained solely on real robot data; our approach improves the average success rate by 14.7% across six representative manipulation tasks.

Problem

Research questions and friction points this paper is trying to address.

Bridging domain gap between human and robot demonstration videos

Converting human videos into robot-usable training supervision

Enabling scalable VLA training with cost-effective human data

Innovation

Methods, ideas, or system contributions that make the work stand out.

H2R Aligner transfers human motion to robot videos

EgoStabilizer stabilizes viewpoints via homography and inpainting

Constrained IK solver maps human trajectories to robot commands

🔎 Similar Papers

Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation

2024-06-20arXiv.orgCitations: 3

R+X: Retrieval and Execution from Everyday Human Videos

2024-07-17arXiv.orgCitations: 12