RwoR: Generating Robot Demonstrations from Human Hand Collection for Policy Learning without Robot

📅 2025-07-05

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Significant visual and kinematic discrepancies between human hand demonstrations and robotic manipulation, coupled with reliance on specialized teleoperation hardware, hinder scalable and cost-effective demonstration data collection. Method: We propose an end-to-end gesture-to-gripper motion generation framework. Using a wrist-mounted GoPro fisheye camera, we capture first-person gesture videos and construct a paired human-hand–robot SE(3) action dataset. A spatiotemporally aligned generative model directly maps hand keypoint sequences to robot gripper trajectories—without requiring physical robots during demonstration recording. Contribution/Results: This work introduces the first calibration-free, teleoperation-free cross-modal motion generation method based solely on monocular fisheye video. Experiments show that policies trained on generated demonstrations achieve performance comparable to those trained on ground-truth demonstrations across diverse dexterous manipulation tasks. Data acquisition efficiency improves by over 5×, substantially enhancing the practicality and scalability of robotic imitation learning.

Technology Category

Application Category

📝 Abstract

Recent advancements in imitation learning have shown promising results in robotic manipulation, driven by the availability of high-quality training data. To improve data collection efficiency, some approaches focus on developing specialized teleoperation devices for robot control, while others directly use human hand demonstrations to obtain training data.However, the former requires both a robotic system and a skilled operator, limiting scalability, while the latter faces challenges in aligning the visual gap between human hand demonstrations and the deployed robot observations.To address this, we propose a human hand data collection system combined with our hand-to-gripper generative model, which translates human hand demonstrations into robot gripper demonstrations, effectively bridging the observation gap.Specifically, a GoPro fisheye camera is mounted on the human wrist to capture human hand demonstrations.We then train a generative model on a self-collected dataset of paired human hand and UMI gripper demonstrations, which have been processed using a tailored data pre-processing strategy to ensure alignment in both timestamps and observations.Therefore, given only human hand demonstrations, we are able to automatically extract the corresponding SE(3) actions and integrate them with high-quality generated robot demonstrations through our generation pipeline for training robotic policy model.In experiments, the robust manipulation performance demonstrates not only the quality of the generated robot demonstrations but also the efficiency and practicality of our data collection method.More demonstrations can be found at: https://rwor.github.io/

Problem

Research questions and friction points this paper is trying to address.

Bridging visual gap between human hand and robot demonstrations

Generating robot training data without physical robot

Aligning human and robot observations for policy learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Human hand data collection with wrist-mounted GoPro

Hand-to-gripper generative model bridges observation gap

Automated SE(3) action extraction from human demonstrations

🔎 Similar Papers

No similar papers found.