ImMimic: Cross-Domain Imitation from Human Videos via Mapping and Interpolation

📅 2025-09-13

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses cross-domain imitation learning from human videos to robotic manipulation, circumventing the need for costly and scarce real-robot demonstration data. To bridge three key domain gaps—visual representation mismatch, morphological disparity between human hands and robotic arms, and physical dynamics inconsistency—we propose a multi-domain co-training framework leveraging Dynamic Time Warping (DTW) and MixUp-based trajectory interpolation. First, human hand poses are kinematically retargeted into robot-executable trajectories; then, DTW aligns temporal sequences across domains to construct an intermediate domain, enabling joint optimization over a small set of teleoperated demonstrations and large-scale unlabeled human videos. The method requires no extensive labeled robot data. Evaluated on four real robotic platforms across four manipulation tasks, it achieves significant improvements in task success rate (+28.6% on average) and motion smoothness, demonstrating strong generalization and effective cross-domain transfer.

Technology Category

Application Category

📝 Abstract

Learning robot manipulation from abundant human videos offers a scalable alternative to costly robot-specific data collection. However, domain gaps across visual, morphological, and physical aspects hinder direct imitation. To effectively bridge the domain gap, we propose ImMimic, an embodiment-agnostic co-training framework that leverages both human videos and a small amount of teleoperated robot demonstrations. ImMimic uses Dynamic Time Warping (DTW) with either action- or visual-based mapping to map retargeted human hand poses to robot joints, followed by MixUp interpolation between paired human and robot trajectories. Our key insights are (1) retargeted human hand trajectories provide informative action labels, and (2) interpolation over the mapped data creates intermediate domains that facilitate smooth domain adaptation during co-training. Evaluations on four real-world manipulation tasks (Pick and Place, Push, Hammer, Flip) across four robotic embodiments (Robotiq, Fin Ray, Allegro, Ability) show that ImMimic improves task success rates and execution smoothness, highlighting its efficacy to bridge the domain gap for robust robot manipulation. The project website can be found at https://sites.google.com/view/immimic.

Problem

Research questions and friction points this paper is trying to address.

Bridging domain gaps between human videos and robot imitation

Mapping human hand poses to robot joints using DTW

Creating intermediate domains via interpolation for smoother adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-domain imitation via mapping and interpolation

Embodiment-agnostic co-training with human videos

Dynamic Time Warping with MixUp interpolation

🔎 Similar Papers

Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation

2024-06-20arXiv.orgCitations: 3