RwoR: Generating Robot Demonstrations from Human Hand Collection for Policy Learning without Robot

📅 2025-07-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Significant visual and kinematic discrepancies between human hand demonstrations and robotic manipulation, coupled with reliance on specialized teleoperation hardware, hinder scalable and cost-effective demonstration data collection. Method: We propose an end-to-end gesture-to-gripper motion generation framework. Using a wrist-mounted GoPro fisheye camera, we capture first-person gesture videos and construct a paired human-hand–robot SE(3) action dataset. A spatiotemporally aligned generative model directly maps hand keypoint sequences to robot gripper trajectories—without requiring physical robots during demonstration recording. Contribution/Results: This work introduces the first calibration-free, teleoperation-free cross-modal motion generation method based solely on monocular fisheye video. Experiments show that policies trained on generated demonstrations achieve performance comparable to those trained on ground-truth demonstrations across diverse dexterous manipulation tasks. Data acquisition efficiency improves by over 5×, substantially enhancing the practicality and scalability of robotic imitation learning.

Technology Category

Application Category

📝 Abstract
Recent advancements in imitation learning have shown promising results in robotic manipulation, driven by the availability of high-quality training data. To improve data collection efficiency, some approaches focus on developing specialized teleoperation devices for robot control, while others directly use human hand demonstrations to obtain training data.However, the former requires both a robotic system and a skilled operator, limiting scalability, while the latter faces challenges in aligning the visual gap between human hand demonstrations and the deployed robot observations.To address this, we propose a human hand data collection system combined with our hand-to-gripper generative model, which translates human hand demonstrations into robot gripper demonstrations, effectively bridging the observation gap.Specifically, a GoPro fisheye camera is mounted on the human wrist to capture human hand demonstrations.We then train a generative model on a self-collected dataset of paired human hand and UMI gripper demonstrations, which have been processed using a tailored data pre-processing strategy to ensure alignment in both timestamps and observations.Therefore, given only human hand demonstrations, we are able to automatically extract the corresponding SE(3) actions and integrate them with high-quality generated robot demonstrations through our generation pipeline for training robotic policy model.In experiments, the robust manipulation performance demonstrates not only the quality of the generated robot demonstrations but also the efficiency and practicality of our data collection method.More demonstrations can be found at: https://rwor.github.io/
Problem

Research questions and friction points this paper is trying to address.

Bridging visual gap between human hand and robot demonstrations
Generating robot training data without physical robot
Aligning human and robot observations for policy learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human hand data collection with wrist-mounted GoPro
Hand-to-gripper generative model bridges observation gap
Automated SE(3) action extraction from human demonstrations
🔎 Similar Papers
No similar papers found.
L
Liang Heng
CFCS, School of Computer Science, Peking University
X
Xiaoqi Li
CFCS, School of Computer Science, Peking University
S
Shangqing Mao
CFCS, School of Computer Science, Peking University
J
Jiaming Liu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
R
Ruolin Liu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
J
Jingli Wei
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Y
Yu-Kai Wang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Yueru Jia
Yueru Jia
School of Computer Science, Peking University
RoboticsAIGCComputer Vision
Chenyang Gu
Chenyang Gu
Undergraduate, Peking University
Embodied AIRobotic Manipulation
R
Rui Zhao
Tencent Robotics X Laboratory
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models
H
Hao Dong
CFCS, School of Computer Science, Peking University