Language-Guided Grasp Detection with Coarse-to-Fine Learning for Robotic Manipulation

📅 2025-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak semantic alignment and superficial vision-language fusion in language-guided grasping for unstructured scenes, this paper proposes a coarse-to-fine cross-modal learning framework. Methodologically, it introduces (1) a hierarchical cross-modal fusion mechanism that enables deep alignment of CLIP-based multimodal embeddings; (2) a language-conditioned dynamic convolutional head (LDCH) for instruction-adaptive grasp localization; and (3) a grasp consistency refinement module to jointly enhance geometric and semantic coherence. Evaluated on OCID-VLG and Grasp-Anything++, the method achieves significant improvements over state-of-the-art approaches, demonstrating strong generalization across diverse object categories and linguistic formulations. Extensive real-world experiments on a physical robotic platform further validate its high-precision execution capability under complex, natural-language grasp instructions.

Technology Category

Application Category

📝 Abstract
Grasping is one of the most fundamental challenging capabilities in robotic manipulation, especially in unstructured, cluttered, and semantically diverse environments. Recent researches have increasingly explored language-guided manipulation, where robots not only perceive the scene but also interpret task-relevant natural language instructions. However, existing language-conditioned grasping methods typically rely on shallow fusion strategies, leading to limited semantic grounding and weak alignment between linguistic intent and visual grasp reasoning.In this work, we propose Language-Guided Grasp Detection (LGGD) with a coarse-to-fine learning paradigm for robotic manipulation. LGGD leverages CLIP-based visual and textual embeddings within a hierarchical cross-modal fusion pipeline, progressively injecting linguistic cues into the visual feature reconstruction process. This design enables fine-grained visual-semantic alignment and improves the feasibility of the predicted grasps with respect to task instructions. In addition, we introduce a language-conditioned dynamic convolution head (LDCH) that mixes multiple convolution experts based on sentence-level features, enabling instruction-adaptive coarse mask and grasp predictions. A final refinement module further enhances grasp consistency and robustness in complex scenes.Experiments on the OCID-VLG and Grasp-Anything++ datasets show that LGGD surpasses existing language-guided grasping methods, exhibiting strong generalization to unseen objects and diverse language queries. Moreover, deployment on a real robotic platform demonstrates the practical effectiveness of our approach in executing accurate, instruction-conditioned grasp actions. The code will be released publicly upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Develops language-guided grasp detection for robots in cluttered environments
Enhances semantic alignment between visual grasp reasoning and linguistic intent
Improves grasp feasibility and robustness for unseen objects and instructions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical cross-modal fusion for visual-semantic alignment
Language-conditioned dynamic convolution head for adaptive predictions
Coarse-to-fine refinement module enhancing grasp consistency