GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions

📅 2025-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large language models (LLMs) lack explicit modeling of object physical properties, leading to poor robustness in natural-language-driven 6-DoF grasping. Method: We propose a physics-aware chain-of-thought (CoT) framework that decomposes the task into three sequential reasoning stages: target parsing, physical analysis, and grasp selection. To support evaluation, we introduce IntentGrasp—the first benchmark enabling multi-object scenes and indirect, physics-referential instructions. Our method employs a unified multimodal architecture integrating multi-view 3D visual encoders with an LLM, incorporates auxiliary question-answering tasks to explicitly foster physical understanding, and enables joint embedding of 3D-aware visual tokens and textual tokens. Contribution/Results: Our approach significantly outperforms state-of-the-art methods on IntentGrasp. Real-robot experiments demonstrate strong generalization and high robustness—especially for deformable objects and implicitly constrained instructions (e.g., “securely grasp a top-heavy cup”).

Technology Category

Application Category

📝 Abstract
Flexible instruction-guided 6-DoF grasping is a significant yet challenging task for real-world robotic systems. Existing methods utilize the contextual understanding capabilities of the large language models (LLMs) to establish mappings between expressions and targets, allowing robots to comprehend users' intentions in the instructions. However, the LLM's knowledge about objects' physical properties remains underexplored despite its tight relevance to grasping. In this work, we propose GraspCoT, a 6-DoF grasp detection framework that integrates a Chain-of-Thought (CoT) reasoning mechanism oriented to physical properties, guided by auxiliary question-answering (QA) tasks. Particularly, we design a set of QA templates to enable hierarchical reasoning that includes three stages: target parsing, physical property analysis, and grasp action selection. Moreover, GraspCoT presents a unified multimodal LLM architecture, which encodes multi-view observations of 3D scenes into 3D-aware visual tokens, and then jointly embeds these visual tokens with CoT-derived textual tokens within LLMs to generate grasp pose predictions. Furthermore, we present IntentGrasp, a large-scale benchmark that fills the gap in public datasets for multi-object grasp detection under diverse and indirect verbal commands. Extensive experiments on IntentGrasp demonstrate the superiority of our method, with additional validation in real-world robotic applications confirming its practicality. Codes and data will be released.
Problem

Research questions and friction points this paper is trying to address.

Enhances 6-DoF grasping with physical property reasoning.
Integrates Chain-of-Thought reasoning for grasp detection.
Develops a multimodal LLM for 3D-aware grasp predictions.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Chain-of-Thought reasoning for physical properties
Uses QA templates for hierarchical reasoning stages
Unified multimodal LLM architecture for grasp predictions
🔎 Similar Papers
No similar papers found.