🤖 AI Summary
This work addresses the challenge of dexterous manipulation, which is hindered by high-dimensional action spaces and the scarcity of large-scale training data, often requiring specialized sensing hardware that limits scalability. The authors propose VIDEOMANIP, a novel framework that enables device-free learning of dexterous manipulation directly from monocular RGB videos of human manipulation. By jointly optimizing hand pose estimation, object mesh reconstruction, and hand-object contact, the method generates interaction-centric 4D hand-object trajectories from a single video and transfers them to a robotic hand for policy learning. This approach synthesizes diverse trajectories from minimal input, substantially improving generalization. In simulation, it achieves a 70.25% success rate across 20 object categories; on the real-world LEAP hand, it attains an average success rate of 62.86% across seven tasks, outperforming conventional retargeting methods by 15.87%.
📝 Abstract
Multi-finger robotic hand manipulation and grasping are challenging due to the high-dimensional action space and the difficulty of acquiring large-scale training data. Existing approaches largely rely on human teleoperation with wearable devices or specialized sensing equipment to capture hand-object interactions, which limits scalability. In this work, we propose VIDEOMANIP, a device-free framework that learns dexterous manipulation directly from RGB human videos. Leveraging recent advances in computer vision, VIDEOMANIP reconstructs explicit 3D robot-object trajectories from monocular videos by estimating human hand poses, object meshes, and retargets the reconstructed human motions to robotic hands for manipulation learning. To make the reconstructed robot data suitable for dexterous manipulation training, we introduce hand-object contact optimization with interaction-centric grasp modeling, as well as a demonstration synthesis strategy that generates diverse training trajectories from a single video, enabling generalizable policy learning without additional robot demonstrations. In simulation, the learned grasping model achieves a 70.25% success rate across 20 diverse objects using the Inspire Hand. In the real world, manipulation policies trained from RGB videos achieve an average 62.86% success rate across seven tasks using the LEAP Hand, outperforming retargeting-based methods by 15.87%. Project videos are available at videomanip.github.io.