DexMan: Learning Bimanual Dexterous Manipulation from Human and Generated Videos

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

This work addresses zero-shot transfer of dexterous bimanual manipulation skills from in-the-wild third-person human demonstration videos to humanoid robots—without camera calibration, depth sensors, 3D object scans, or manual action annotations. Methodologically, it is the first to directly leverage noisy hand-object pose estimates, integrating vision-driven pose estimation, self-supervised motion reconstruction, and a contact-aware reward function to train a general-purpose policy end-to-end in simulation; generalization is further enhanced by fusing real and synthetic video data. The core contribution is a contact-based reinforcement learning reward mechanism that eliminates reliance on motion-capture data or fine-grained annotations. On the TACO benchmark, the method improves ADD-S and VSD metrics by 0.08 and 0.12, respectively; on OakInk-v2, task success rate increases by 19% over prior state-of-the-art, validating its effectiveness for high-generalization dexterous manipulation skill learning.

Technology Category

Application Category

📝 Abstract

We present DexMan, an automated framework that converts human visual demonstrations into bimanual dexterous manipulation skills for humanoid robots in simulation. Operating directly on third-person videos of humans manipulating rigid objects, DexMan eliminates the need for camera calibration, depth sensors, scanned 3D object assets, or ground-truth hand and object motion annotations. Unlike prior approaches that consider only simplified floating hands, it directly controls a humanoid robot and leverages novel contact-based rewards to improve policy learning from noisy hand-object poses estimated from in-the-wild videos. DexMan achieves state-of-the-art performance in object pose estimation on the TACO benchmark, with absolute gains of 0.08 and 0.12 in ADD-S and VSD. Meanwhile, its reinforcement learning policy surpasses previous methods by 19% in success rate on OakInk-v2. Furthermore, DexMan can generate skills from both real and synthetic videos, without the need for manual data collection and costly motion capture, and enabling the creation of large-scale, diverse datasets for training generalist dexterous manipulation.

Problem

Research questions and friction points this paper is trying to address.

Converts human videos to bimanual robot manipulation skills

Eliminates need for camera calibration and motion annotations

Learns policies from noisy poses in real-world videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Converts human videos into bimanual robot skills

Uses contact-based rewards for policy learning

Generates skills from real and synthetic videos

🔎 Similar Papers

No similar papers found.