extsc{Gen2Real}: Towards Demo-Free Dexterous Manipulation by Harnessing Generated Video

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Dexterous manipulation is hindered by the scarcity of human demonstration data. To address this, we propose an end-to-end learning paradigm that requires no real-world human demonstrations—only a single synthetic video clip serves as the sole supervisory signal. Our method reconstructs hand–object trajectories via video-driven pose and depth estimation, then refines them using a physics-informed interaction optimization model (PIOM) to ensure dynamic plausibility. We further integrate anchor-based residual PPO for policy learning and action retargeting to support natural-language task specification and sim-to-real transfer. In simulation, our approach achieves a 77.3% grasping success rate; on a physical robot, it enables coherent and generalizable dexterous manipulation. Experimental results validate both the effectiveness of our framework and its practical deployability.

Technology Category

Application Category

📝 Abstract
Dexterous manipulation remains a challenging robotics problem, largely due to the difficulty of collecting extensive human demonstrations for learning. In this paper, we introduce extsc{Gen2Real}, which replaces costly human demos with one generated video and drives robot skill from it: it combines demonstration generation that leverages video generation with pose and depth estimation to yield hand-object trajectories, trajectory optimization that uses Physics-aware Interaction Optimization Model (PIOM) to impose physics consistency, and demonstration learning that retargets human motions to a robot hand and stabilizes control with an anchor-based residual Proximal Policy Optimization (PPO) policy. Using only generated videos, the learned policy achieves a 77.3% success rate on grasping tasks in simulation and demonstrates coherent executions on a real robot. We also conduct ablation studies to validate the contribution of each component and demonstrate the ability to directly specify tasks using natural language, highlighting the flexibility and robustness of extsc{Gen2Real} in generalizing grasping skills from imagined videos to real-world execution.
Problem

Research questions and friction points this paper is trying to address.

Achieving dexterous manipulation without human demonstrations
Generating realistic hand-object trajectories from videos
Transferring learned skills from simulation to real robots
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video generation replaces human demonstrations
Physics-aware optimization ensures trajectory consistency
Anchor-based PPO policy stabilizes control learning
🔎 Similar Papers
No similar papers found.
K
Kai Ye
The Chinese University of Hong Kong, Shenzhen
Yuhang Wu
Yuhang Wu
Universitat Pompeu Fabra
S
Shuyuan Hu
Shenzhen Institute of Artificial Intelligence and Robotics for Society
J
Junliang Li
The Chinese University of Hong Kong, Shenzhen
M
Meng Liu
Shenzhen Institute of Artificial Intelligence and Robotics for Society
Yongquan Chen
Yongquan Chen
The Chinese University of Hong Kong, Shenzhen
R
Rui Huang
The Chinese University of Hong Kong, Shenzhen