π€ AI Summary
Existing dexterous grasping methods rely heavily on large-scale annotated datasets, limiting generalization to unseen objects and diverse task instructions.
Method: We propose the first zero-shot, task-oriented grasping framework requiring no training data. It employs a multimodal large language model (MLLM) with prompt engineering for multi-stage semantic reasoning to precisely align task intent with object affordances; subsequently predicts semantically meaningful contact regions and optimizes dexterous grasp poses under physical constraints.
Contribution/Results: This work establishes the first end-to-end integration of MLLMs with contact-aware grasp optimization, significantly enhancing zero-shot generalization across novel objects and complex instructions (e.g., βpinch the edge using thumb and index fingerβ). Experiments demonstrate high success rates and strong task compliance on diverse unseen objects and intricate manipulation directives, introducing a new paradigm for general-purpose intelligent grasping.
π Abstract
Task-oriented dexterous grasping holds broad application prospects in robotic manipulation and human-object interaction. However, most existing methods still struggle to generalize across diverse objects and task instructions, as they heavily rely on costly labeled data to ensure task-specific semantic alignment. In this study, we propose extbf{ZeroDexGrasp}, a zero-shot task-oriented dexterous grasp synthesis framework integrating Multimodal Large Language Models with grasp refinement to generate human-like grasp poses that are well aligned with specific task objectives and object affordances. Specifically, ZeroDexGrasp employs prompt-based multi-stage semantic reasoning to infer initial grasp configurations and object contact information from task and object semantics, then exploits contact-guided grasp optimization to refine these poses for physical feasibility and task alignment. Experimental results demonstrate that ZeroDexGrasp enables high-quality zero-shot dexterous grasping on diverse unseen object categories and complex task requirements, advancing toward more generalizable and intelligent robotic grasping.