🤖 AI Summary
This work addresses two key challenges in real-world dexterous manipulation: (1) low-quality single-view point clouds due to limited sensor resolution, occlusion by the dexterous hand, and suboptimal viewing angles; and (2) the absence of contact information and explicit hand-object spatial correspondence in global point cloud representations. To this end, we propose an interaction-aware point cloud representation method: (i) we introduce the first object-centric contact graph to explicitly encode physical interactions; (ii) we jointly model coordinated hand-arm dynamics; and (iii) we integrate 6D object pose estimation with proprioceptive sensing to enable end-to-end visuomotor policy learning. Evaluated on four real-world dexterous manipulation tasks, our approach achieves a mean success rate of 90%, significantly outperforming all baselines. It demonstrates strong generalization and robustness across multi-object setups, varying viewpoints, and complex scenes.
📝 Abstract
Achieving human-level dexterity in robots is a key objective in the field of robotic manipulation. Recent advancements in 3D-based imitation learning have shown promising results, providing an effective pathway to achieve this goal. However, obtaining high-quality 3D representations presents two key problems: (1) the quality of point clouds captured by a single-view camera is significantly affected by factors such as camera resolution, positioning, and occlusions caused by the dexterous hand; (2) the global point clouds lack crucial contact information and spatial correspondences, which are necessary for fine-grained dexterous manipulation tasks. To eliminate these limitations, we propose CordViP, a novel framework that constructs and learns correspondences by leveraging the robust 6D pose estimation of objects and robot proprioception. Specifically, we first introduce the interaction-aware point clouds, which establish correspondences between the object and the hand. These point clouds are then used for our pre-training policy, where we also incorporate object-centric contact maps and hand-arm coordination information, effectively capturing both spatial and temporal dynamics. Our method demonstrates exceptional dexterous manipulation capabilities with an average success rate of 90% in four real-world tasks, surpassing other baselines by a large margin. Experimental results also highlight the superior generalization and robustness of CordViP to different objects, viewpoints, and scenarios. Code and videos are available on https://aureleopku.github.io/CordViP.