🤖 AI Summary
This work addresses the challenge of direct tactile-proprioceptive-to-action mapping in dexterous multi-fingered robotic grasping, aiming for robust and generalizable manipulation of both rigid and deformable objects. We propose a unified graph-structured polar-coordinate multimodal representation that explicitly encodes hand morphology variations across platforms. To realize cross-platform perception–action mapping, we design the Tactile-Kinematic Spatio-Temporal Graph Network (TK-STGN), which integrates multidimensional subgraph convolutions with attention-enhanced LSTM modules. Leveraging human hand demonstrations collected via data gloves, our approach combines imitation learning with hybrid force–position control. Extensive experiments on multiple robotic platforms demonstrate significant improvements in grasping success rates for unseen and deformable objects. To the best of our knowledge, this is the first framework enabling proprioceptively grounded, multimodal perception–action transfer and generalization across diverse robotic hands.
📝 Abstract
Tactile and kinesthetic perceptions are crucial for human dexterous manipulation, enabling reliable grasping of objects via proprioceptive sensorimotor integration. For robotic hands, even though acquiring such tactile and kinesthetic feedback is feasible, establishing a direct mapping from this sensory feedback to motor actions remains challenging. In this paper, we propose a novel glove-mediated tactile-kinematic perception-prediction framework for grasp skill transfer from human intuitive and natural operation to robotic execution based on imitation learning, and its effectiveness is validated through generalized grasping tasks, including those involving deformable objects. Firstly, we integrate a data glove to capture tactile and kinesthetic data at the joint level. The glove is adaptable for both human and robotic hands, allowing data collection from natural human hand demonstrations across different scenarios. It ensures consistency in the raw data format, enabling evaluation of grasping for both human and robotic hands. Secondly, we establish a unified representation of multi-modal inputs based on graph structures with polar coordinates. We explicitly integrate the morphological differences into the designed representation, enhancing the compatibility across different demonstrators and robotic hands. Furthermore, we introduce the Tactile-Kinesthetic Spatio-Temporal Graph Networks (TK-STGN), which leverage multidimensional subgraph convolutions and attention-based LSTM layers to extract spatio-temporal features from graph inputs to predict node-based states for each hand joint. These predictions are then mapped to final commands through a force-position hybrid mapping.