🤖 AI Summary
This work addresses the limitations of traditional dexterous manipulation, which relies solely on a wrist-mounted camera and is prone to occlusions, thereby struggling with complex tasks requiring multi-view perception. The authors propose a novel dexterous hand system integrating miniature cameras at multiple fingertips, combining third-person and multi-view fingertip vision to enable end-to-end learning of manipulation policies from human demonstrations via diffusion models. By innovatively incorporating embedded fingertip vision and fusing camera pose with joint current encodings, the approach significantly enhances visuo-proprioceptive alignment and contact awareness. Evaluated on a suite of challenging real-world tasks—including pressing buttons in confined spaces, retrieving objects from unstable supports, grasping under occlusion, and long-horizon cabinet opening—the method achieves an overall success rate of 80.8%, demonstrating strong robustness and generalization capabilities.
📝 Abstract
The current practice of dexterous manipulation generally relies on a single wrist-mounted view, which is often occluded and limits performance on tasks requiring multi-view perception. In this work, we present FingerViP, a learning system that utilizes a visuomotor policy with fingertip visual perception for dexterous manipulation. Specifically, we design a vision-enhanced fingertip module with an embedded miniature camera and install the modules on each finger of a multi-fingered hand. The fingertip cameras substantially improve visual perception by providing comprehensive, multi-view feedback of both the hand and its surrounding environment. Building on the integrated fingertip modules, we develop a diffusion-based whole-body visuomotor policy conditioned on a third-view camera and multi-view fingertip vision, which effectively learns complex manipulation skills directly from human demonstrations. To improve view-proprioception alignment and contact awareness, each fingertip visual feature is augmented with its corresponding camera pose encoding and per-finger joint-current encoding. We validate the effectiveness of the multi-view fingertip vision and demonstrate the robustness and adaptability of FingerViP on various challenging real-world tasks, including pressing buttons inside a confined box, retrieving sticks from an unstable support, retrieving objects behind an occluding curtain, and performing long-horizon cabinet opening and object retrieval, achieving an overall success rate of 80.8%. All hardware designs and code will be fully open-sourced.