🤖 AI Summary
Addressing the challenges of modeling fine-grained hand-object interactions and weak sim-to-real transfer in dexterous robotic manipulation, this paper proposes learning transferable, fine-grained manipulation priors from large-scale first-person video data. We introduce the first explicit modeling of hand-object contact points alongside high-fidelity hand pose estimation, establishing a multimodal vision-driven joint contact-pose representation. Leveraging this prior, we design a reinforcement learning framework that enables efficient policy training and robust cross-domain transfer. Our approach achieves significant success-rate improvements across multiple simulation benchmarks and newly introduced high-difficulty tasks. Real-world evaluation on a physical dexterous hand platform demonstrates superior generalization and robustness over state-of-the-art methods. Key contributions are: (1) a novel transferable paradigm for joint contact-pose representation learning; and (2) a unified evaluation framework bridging simulation pretraining and real-world deployment.
📝 Abstract
Large-scale egocentric video datasets capture diverse human activities across a wide range of scenarios, offering rich and detailed insights into how humans interact with objects, especially those that require fine-grained dexterous control. Such complex, dexterous skills with precise controls are crucial for many robotic manipulation tasks, yet are often insufficiently addressed by traditional data-driven approaches to robotic manipulation. To address this gap, we leverage manipulation priors learned from large-scale egocentric video datasets to improve policy learning for dexterous robotic manipulation tasks. We present MAPLE, a novel method for dexterous robotic manipulation that exploits rich manipulation priors to enable efficient policy learning and better performance on diverse, complex manipulation tasks. Specifically, we predict hand-object contact points and detailed hand poses at the moment of hand-object contact and use the learned features to train policies for downstream manipulation tasks. Experimental results demonstrate the effectiveness of MAPLE across existing simulation benchmarks, as well as a newly designed set of challenging simulation tasks, which require fine-grained object control and complex dexterous skills. The benefits of MAPLE are further highlighted in real-world experiments using a dexterous robotic hand, whereas simultaneous evaluation across both simulation and real-world experiments has remained underexplored in prior work.