Dexterous Manipulation Policies from RGB Human Videos via 3D Hand-Object Trajectory Reconstruction

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the challenge of dexterous manipulation, which is hindered by high-dimensional action spaces and the scarcity of large-scale training data, often requiring specialized sensing hardware that limits scalability. The authors propose VIDEOMANIP, a novel framework that enables device-free learning of dexterous manipulation directly from monocular RGB videos of human manipulation. By jointly optimizing hand pose estimation, object mesh reconstruction, and hand-object contact, the method generates interaction-centric 4D hand-object trajectories from a single video and transfers them to a robotic hand for policy learning. This approach synthesizes diverse trajectories from minimal input, substantially improving generalization. In simulation, it achieves a 70.25% success rate across 20 object categories; on the real-world LEAP hand, it attains an average success rate of 62.86% across seven tasks, outperforming conventional retargeting methods by 15.87%.

Technology Category

Application Category

📝 Abstract

Multi-finger robotic hand manipulation and grasping are challenging due to the high-dimensional action space and the difficulty of acquiring large-scale training data. Existing approaches largely rely on human teleoperation with wearable devices or specialized sensing equipment to capture hand-object interactions, which limits scalability. In this work, we propose VIDEOMANIP, a device-free framework that learns dexterous manipulation directly from RGB human videos. Leveraging recent advances in computer vision, VIDEOMANIP reconstructs explicit 3D robot-object trajectories from monocular videos by estimating human hand poses, object meshes, and retargets the reconstructed human motions to robotic hands for manipulation learning. To make the reconstructed robot data suitable for dexterous manipulation training, we introduce hand-object contact optimization with interaction-centric grasp modeling, as well as a demonstration synthesis strategy that generates diverse training trajectories from a single video, enabling generalizable policy learning without additional robot demonstrations. In simulation, the learned grasping model achieves a 70.25% success rate across 20 diverse objects using the Inspire Hand. In the real world, manipulation policies trained from RGB videos achieve an average 62.86% success rate across seven tasks using the LEAP Hand, outperforming retargeting-based methods by 15.87%. Project videos are available at videomanip.github.io.

Problem

Research questions and friction points this paper is trying to address.

dexterous manipulation

robotic hand

training data scarcity

human-object interaction

scalability

Innovation

Methods, ideas, or system contributions that make the work stand out.

dexterous manipulation

4D trajectory reconstruction

RGB video learning