🤖 AI Summary
Existing task-oriented grasping (TOG) methods rely on an inefficient two-stage paradigm: blind sampling of 6-DoF gripper poses followed by constraint-based filtering using human demonstrations, resulting in low sampling efficiency and poor stability. This work proposes DiffGrasp, the first single-stage end-to-end diffusion framework for TOG that explicitly incorporates human grasp demonstrations into the diffusion sampling process, enabling direct generation of physically stable, task-constrained 6-DoF parallel-jaw grasps. We adopt a Diffusion Transformer (DiT) as the backbone architecture to enhance pose representation learning and modeling fidelity. Experiments demonstrate significant improvements in grasp success rate and generation efficiency across multi-object, multi-task scenarios, while achieving more robust imitation of human grasping strategies. DiffGrasp establishes a new, efficient, and interpretable paradigm for TOG, bridging high-level task specifications with low-level physical feasibility in a unified generative framework.
📝 Abstract
Task-oriented grasping (TOG) is essential for robots to perform manipulation tasks, requiring grasps that are both stable and compliant with task-specific constraints. Humans naturally grasp objects in a task-oriented manner to facilitate subsequent manipulation tasks. By leveraging human grasp demonstrations, current methods can generate high-quality robotic parallel-jaw task-oriented grasps for diverse objects and tasks. However, they still encounter challenges in maintaining grasp stability and sampling efficiency. These methods typically rely on a two-stage process: first performing exhaustive task-agnostic grasp sampling in the 6-DoF space, then applying demonstration-induced constraints (e.g., contact regions and wrist orientations) to filter candidates. This leads to inefficiency and potential failure due to the vast sampling space. To address this, we propose the Human-guided Grasp Diffuser (HGDiffuser), a diffusion-based framework that integrates these constraints into a guided sampling process. Through this approach, HGDiffuser directly generates 6-DoF task-oriented grasps in a single stage, eliminating exhaustive task-agnostic sampling. Furthermore, by incorporating Diffusion Transformer (DiT) blocks as the feature backbone, HGDiffuser improves grasp generation quality compared to MLP-based methods. Experimental results demonstrate that our approach significantly improves the efficiency of task-oriented grasp generation, enabling more effective transfer of human grasping strategies to robotic systems. To access the source code and supplementary videos, visit https://sites.google.com/view/hgdiffuser.