🤖 AI Summary
This work addresses the challenges in dexterous grasping where dense pose or contact targets are difficult to specify and end-to-end reinforcement learning suffers from poor controllability. To overcome these limitations, the authors propose GRIT, a two-stage framework that first predicts sparse, taxonomy-based grasp instructions grounded in scene and task context, then conditions continuous multi-finger motion generation on these high-level directives to jointly optimize task completion and grasp structural stability. GRIT is the first approach to integrate high-level grasp taxonomies with dexterous control, enabling semantic user intervention and significantly improving generalization to novel objects and system controllability. Experiments demonstrate that GRIT achieves an overall success rate of 87.9% in simulation, outperforming existing baselines, and validate its policy adaptability on a physical robot.
📝 Abstract
Dexterous manipulation requires planning a grasp configuration suited to the object and task, which is then executed through coordinated multi-finger control. However, specifying grasp plans with dense pose or contact targets for every object and task is impractical. Meanwhile, end-to-end reinforcement learning from task rewards alone lacks controllability, making it difficult for users to intervene when failures occur. To this end, we present GRIT, a two-stage framework that learns dexterous control from sparse taxonomy guidance. GRIT first predicts a taxonomy-based grasp specification from the scene and task context. Conditioned on this sparse command, a policy generates continuous finger motions that accomplish the task while preserving the intended grasp structure. Our result shows that certain grasp taxonomies are more effective for specific object geometries. By leveraging this relationship, GRIT improves generalization to novel objects over baselines and achieves an overall success rate of 87.9%. Moreover, real-world experiments demonstrate controllability, enabling grasp strategies to be adjusted through high-level taxonomy selection based on object geometry and task intent.