๐ค AI Summary
This work addresses zero-shot transfer of dexterous bimanual manipulation skills from in-the-wild third-person human demonstration videos to humanoid robotsโwithout camera calibration, depth sensors, 3D object scans, or manual action annotations. Methodologically, it is the first to directly leverage noisy hand-object pose estimates, integrating vision-driven pose estimation, self-supervised motion reconstruction, and a contact-aware reward function to train a general-purpose policy end-to-end in simulation; generalization is further enhanced by fusing real and synthetic video data. The core contribution is a contact-based reinforcement learning reward mechanism that eliminates reliance on motion-capture data or fine-grained annotations. On the TACO benchmark, the method improves ADD-S and VSD metrics by 0.08 and 0.12, respectively; on OakInk-v2, task success rate increases by 19% over prior state-of-the-art, validating its effectiveness for high-generalization dexterous manipulation skill learning.
๐ Abstract
We present DexMan, an automated framework that converts human visual demonstrations into bimanual dexterous manipulation skills for humanoid robots in simulation. Operating directly on third-person videos of humans manipulating rigid objects, DexMan eliminates the need for camera calibration, depth sensors, scanned 3D object assets, or ground-truth hand and object motion annotations. Unlike prior approaches that consider only simplified floating hands, it directly controls a humanoid robot and leverages novel contact-based rewards to improve policy learning from noisy hand-object poses estimated from in-the-wild videos. DexMan achieves state-of-the-art performance in object pose estimation on the TACO benchmark, with absolute gains of 0.08 and 0.12 in ADD-S and VSD. Meanwhile, its reinforcement learning policy surpasses previous methods by 19% in success rate on OakInk-v2. Furthermore, DexMan can generate skills from both real and synthetic videos, without the need for manual data collection and costly motion capture, and enabling the creation of large-scale, diverse datasets for training generalist dexterous manipulation.