VCoT-Grasp: Grasp Foundation Models with Visual Chain-of-Thought Reasoning for Language-driven Grasp Generation

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing language-guided grasping methods suffer from poor generalization, reliance on complex modular architectures, and an overemphasis—within mainstream grasping foundation models—on dialog understanding and single-object semantics, limiting their effectiveness in multi-object and dynamic scenes. To address these limitations, we propose VCoT-Grasp, the first end-to-end visual chain-of-thought grasping foundation model, which directly maps natural language instructions to 6-DoF grasping poses via iterative visual attention and interpretable, stepwise reasoning. Our approach innovatively integrates intermediate bounding-box supervision, joint training on synthetic and real-world data, and a dynamic attention mechanism. Evaluated on the newly introduced VCoT-GraspSet benchmark and real-robot experiments, VCoT-Grasp achieves significant improvements in grasping success rates and demonstrates strong generalization and robustness across unseen objects, cluttered backgrounds, and distractor-rich environments.

Technology Category

Application Category

📝 Abstract

Robotic grasping is one of the most fundamental tasks in robotic manipulation, and grasp detection/generation has long been the subject of extensive research. Recently, language-driven grasp generation has emerged as a promising direction due to its practical interaction capabilities. However, most existing approaches either lack sufficient reasoning and generalization capabilities or depend on complex modular pipelines. Moreover, current grasp foundation models tend to overemphasize dialog and object semantics, resulting in inferior performance and restriction to single-object grasping. To maintain strong reasoning ability and generalization in cluttered environments, we propose VCoT-Grasp, an end-to-end grasp foundation model that incorporates visual chain-of-thought reasoning to enhance visual understanding for grasp generation. VCoT-Grasp adopts a multi-turn processing paradigm that dynamically focuses on visual inputs while providing interpretable reasoning traces. For training, we refine and introduce a large-scale dataset, VCoT-GraspSet, comprising 167K synthetic images with over 1.36M grasps, as well as 400+ real-world images with more than 1.2K grasps, annotated with intermediate bounding boxes. Extensive experiments on both VCoT-GraspSet and real robot demonstrate that our method significantly improves grasp success rates and generalizes effectively to unseen objects, backgrounds, and distractors. More details can be found at https://zhanghr2001.github.io/VCoT-Grasp.github.io.

Problem

Research questions and friction points this paper is trying to address.

Enhancing robotic grasp generation with visual reasoning

Overcoming limitations in cluttered environments and generalization

Providing interpretable reasoning traces for grasp decisions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses visual chain-of-thought reasoning for grasp generation

Adopts multi-turn processing with interpretable reasoning traces

Trains on large-scale dataset with intermediate bounding boxes

🔎 Similar Papers

No similar papers found.