🤖 AI Summary
This work addresses the poor generalization in human-to-robot action transfer caused by morphological and kinematic disparities between humans and robots. We propose a generative imitation learning method grounded in strictly spatiotemporally aligned human–robot paired videos. To this end, we introduce the first fine-grained aligned Human & Robot (H&R) dataset—comprising 2,600 VR-captured hand–arm grasping video pairs—and formulate imitation learning as an end-to-end diffusion task jointly generating robot-execution videos and predicting action sequences. Key innovations include a third-person VR capture paradigm, explicit hand–gripper motion mapping modeling, and a temporal diffusion architecture. Evaluated across eight real-world scenarios—including unseen objects, backgrounds, and novel tasks—our method significantly outperforms baselines. It simultaneously produces high-fidelity robot execution videos and deployable control commands, achieving strong generalization across tasks, objects, and environments.
📝 Abstract
Distilling knowledge from human demonstrations is a promising way for robots to learn and act. Existing work often overlooks the differences between humans and robots, producing unsatisfactory results. In this paper, we study how perfectly aligned human-robot pairs benefit robot learning. Capitalizing on VR-based teleportation, we introduce H&R, a third-person dataset with 2,600 episodes, each of which captures the fine-grained correspondence between human hands and robot gripper. Inspired by the recent success of diffusion models, we introduce Human2Robot, an end-to-end diffusion framework that formulates learning from human demonstrates as a generative task. Human2Robot fully explores temporal dynamics in human videos to generate robot videos and predict actions at the same time. Through comprehensive evaluations of 8 seen, changed and unseen tasks in real-world settings, we demonstrate that Human2Robot can not only generate high-quality robot videos but also excel in seen tasks and generalize to unseen objects, backgrounds and even new tasks effortlessly.