🤖 AI Summary
Existing large language models (LLMs) lack explicit modeling of object physical properties, leading to poor robustness in natural-language-driven 6-DoF grasping. Method: We propose a physics-aware chain-of-thought (CoT) framework that decomposes the task into three sequential reasoning stages: target parsing, physical analysis, and grasp selection. To support evaluation, we introduce IntentGrasp—the first benchmark enabling multi-object scenes and indirect, physics-referential instructions. Our method employs a unified multimodal architecture integrating multi-view 3D visual encoders with an LLM, incorporates auxiliary question-answering tasks to explicitly foster physical understanding, and enables joint embedding of 3D-aware visual tokens and textual tokens. Contribution/Results: Our approach significantly outperforms state-of-the-art methods on IntentGrasp. Real-robot experiments demonstrate strong generalization and high robustness—especially for deformable objects and implicitly constrained instructions (e.g., “securely grasp a top-heavy cup”).
📝 Abstract
Flexible instruction-guided 6-DoF grasping is a significant yet challenging task for real-world robotic systems. Existing methods utilize the contextual understanding capabilities of the large language models (LLMs) to establish mappings between expressions and targets, allowing robots to comprehend users' intentions in the instructions. However, the LLM's knowledge about objects' physical properties remains underexplored despite its tight relevance to grasping. In this work, we propose GraspCoT, a 6-DoF grasp detection framework that integrates a Chain-of-Thought (CoT) reasoning mechanism oriented to physical properties, guided by auxiliary question-answering (QA) tasks. Particularly, we design a set of QA templates to enable hierarchical reasoning that includes three stages: target parsing, physical property analysis, and grasp action selection. Moreover, GraspCoT presents a unified multimodal LLM architecture, which encodes multi-view observations of 3D scenes into 3D-aware visual tokens, and then jointly embeds these visual tokens with CoT-derived textual tokens within LLMs to generate grasp pose predictions. Furthermore, we present IntentGrasp, a large-scale benchmark that fills the gap in public datasets for multi-object grasp detection under diverse and indirect verbal commands. Extensive experiments on IntentGrasp demonstrate the superiority of our method, with additional validation in real-world robotic applications confirming its practicality. Codes and data will be released.