MimicDroid: In-Context Learning for Humanoid Robot Manipulation from Human Play Videos

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This paper addresses the challenge of few-shot robotic manipulation learning for humanoid robots—specifically, how to achieve cross-environment generalization and contextual learning from unlabeled, continuous human gameplay videos. Methodologically, it introduces the first framework that leverages unedited human play videos for contextual learning, proposing trajectory-pair distillation and pose retargeting: wrist pose is estimated from RGB video and mapped to the robot’s configuration, while a conditional action prediction model—incorporating random block masking and behavior similarity matching—bridges morphological disparities between humans and robots. Evaluated on a newly released open-source simulation benchmark, the approach significantly outperforms prior methods; on real-world tasks, it achieves nearly double the success rate, demonstrating strong generalization capability and practical efficacy.

Technology Category

Application Category

📝 Abstract

We aim to enable humanoid robots to efficiently solve new manipulation tasks from a few video examples. In-context learning (ICL) is a promising framework for achieving this goal due to its test-time data efficiency and rapid adaptability. However, current ICL methods rely on labor-intensive teleoperated data for training, which restricts scalability. We propose using human play videos -- continuous, unlabeled videos of people interacting freely with their environment -- as a scalable and diverse training data source. We introduce MimicDroid, which enables humanoids to perform ICL using human play videos as the only training data. MimicDroid extracts trajectory pairs with similar manipulation behaviors and trains the policy to predict the actions of one trajectory conditioned on the other. Through this process, the model acquired ICL capabilities for adapting to novel objects and environments at test time. To bridge the embodiment gap, MimicDroid first retargets human wrist poses estimated from RGB videos to the humanoid, leveraging kinematic similarity. It also applies random patch masking during training to reduce overfitting to human-specific cues and improve robustness to visual differences. To evaluate few-shot learning for humanoids, we introduce an open-source simulation benchmark with increasing levels of generalization difficulty. MimicDroid outperformed state-of-the-art methods and achieved nearly twofold higher success rates in the real world. Additional materials can be found on: ut-austin-rpl.github.io/MimicDroid

Problem

Research questions and friction points this paper is trying to address.

Enabling humanoid robots to learn manipulation from few human videos

Overcoming reliance on teleoperated data with human play videos

Bridging embodiment gap between human and robot kinematics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses human play videos for training

Retargets human wrist poses to robot

Applies random patch masking for robustness

🔎 Similar Papers

Vision-based Manipulation from Single Human Video with Open-World Object Graphs