World Models Can Leverage Human Videos for Dexterous Manipulation

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Dexterous manipulation faces dual challenges: difficulty in modeling hand-object interactions and scarcity of high-quality demonstration data. To address these, we propose DexWM—the first world model specifically designed for dexterous manipulation—capable of fine-grained hand-object dynamics prediction via latent-space environment state forecasting. Methodologically: (1) We perform multi-source self-supervised pretraining on >900 hours of human and non-dexterous robotic video data, alleviating the dexterous data bottleneck; (2) We introduce a novel hand-consistency auxiliary loss that explicitly enforces kinematic plausibility of hand poses, compensating for limitations of purely vision-based fine-grained pose estimation; (3) We incorporate hand-motion constraint regularization to improve zero-shot generalization. Evaluated on the Franka Panda + Allegro platform, DexWM achieves zero-shot transfer to unseen manipulation skills, outperforming Diffusion Policy by over 50% in success rates on grasping, placing, and reaching tasks.

Technology Category

Application Category

📝 Abstract

Dexterous manipulation is challenging because it requires understanding how subtle hand motion influences the environment through contact with objects. We introduce DexWM, a Dexterous Manipulation World Model that predicts the next latent state of the environment conditioned on past states and dexterous actions. To overcome the scarcity of dexterous manipulation datasets, DexWM is trained on over 900 hours of human and non-dexterous robot videos. To enable fine-grained dexterity, we find that predicting visual features alone is insufficient; therefore, we introduce an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, and full-body actions, achieving more accurate predictions of future states. DexWM also demonstrates strong zero-shot generalization to unseen manipulation skills when deployed on a Franka Panda arm equipped with an Allegro gripper, outperforming Diffusion Policy by over 50% on average in grasping, placing, and reaching tasks.

Problem

Research questions and friction points this paper is trying to address.

Predicting future states for dexterous manipulation tasks

Overcoming scarcity of dexterous manipulation training data

Enhancing fine-grained hand control in robotic manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training on human videos for dexterous manipulation

Using auxiliary hand consistency loss for fine-grained control

Zero-shot generalization to unseen skills on robot arm

🔎 Similar Papers

DexSim2Real2: Building Explicit World Model for Precise Articulated Object Dexterous Manipulation

2024-09-13arXiv.orgCitations: 0

Vision-based Manipulation from Single Human Video with Open-World Object Graphs

2024-05-30arXiv.orgCitations: 40

ViViDex: Learning Vision-based Dexterous Manipulation from Human Videos

2024-04-24arXiv.orgCitations: 4