Human2Robot: Learning Robot Actions from Paired Human-Robot Videos

📅 2025-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the poor generalization in human-to-robot action transfer caused by morphological and kinematic disparities between humans and robots. We propose a generative imitation learning method grounded in strictly spatiotemporally aligned human–robot paired videos. To this end, we introduce the first fine-grained aligned Human & Robot (H&R) dataset—comprising 2,600 VR-captured hand–arm grasping video pairs—and formulate imitation learning as an end-to-end diffusion task jointly generating robot-execution videos and predicting action sequences. Key innovations include a third-person VR capture paradigm, explicit hand–gripper motion mapping modeling, and a temporal diffusion architecture. Evaluated across eight real-world scenarios—including unseen objects, backgrounds, and novel tasks—our method significantly outperforms baselines. It simultaneously produces high-fidelity robot execution videos and deployable control commands, achieving strong generalization across tasks, objects, and environments.

Technology Category

Application Category

📝 Abstract
Distilling knowledge from human demonstrations is a promising way for robots to learn and act. Existing work often overlooks the differences between humans and robots, producing unsatisfactory results. In this paper, we study how perfectly aligned human-robot pairs benefit robot learning. Capitalizing on VR-based teleportation, we introduce H&R, a third-person dataset with 2,600 episodes, each of which captures the fine-grained correspondence between human hands and robot gripper. Inspired by the recent success of diffusion models, we introduce Human2Robot, an end-to-end diffusion framework that formulates learning from human demonstrates as a generative task. Human2Robot fully explores temporal dynamics in human videos to generate robot videos and predict actions at the same time. Through comprehensive evaluations of 8 seen, changed and unseen tasks in real-world settings, we demonstrate that Human2Robot can not only generate high-quality robot videos but also excel in seen tasks and generalize to unseen objects, backgrounds and even new tasks effortlessly.
Problem

Research questions and friction points this paper is trying to address.

Learning robot actions from human videos
Aligning human-robot pairs for better learning
Generating robot actions via diffusion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

VR-based teleportation dataset
Diffusion model framework
Temporal dynamics exploration