🤖 AI Summary
This work addresses the limitations of existing language-guided grasping methods in cluttered or occluded scenes, where multi-stage pipelines often suffer from weak cross-modal fusion, computational redundancy, and limited generalization. To overcome these challenges, the authors propose GeoLanG, an end-to-end multitask framework that unifies visual and linguistic representations through CLIP and explicitly models geometric priors via a Depth-Guided Geometric Module (DGGM). An adaptive dense channel fusion mechanism further enables efficient and precise cross-modal alignment. Extensive experiments on the OCID-VLG dataset and both real-world and simulated robotic platforms demonstrate that GeoLanG significantly outperforms current state-of-the-art approaches, achieving high accuracy and strong robustness in diverse and challenging environments.
📝 Abstract
Language-guided grasping has emerged as a promising paradigm for enabling robots to identify and manipulate target objects through natural language instructions, yet it remains highly challenging in cluttered or occluded scenes. Existing methods often rely on multi-stage pipelines that separate object perception and grasping, which leads to limited cross-modal fusion, redundant computation, and poor generalization in cluttered, occluded, or low-texture scenes. To address these limitations, we propose GeoLanG, an end-to-end multi-task framework built upon the CLIP architecture that unifies visual and linguistic inputs into a shared representation space for robust semantic alignment and improved generalization. To enhance target discrimination under occlusion and low-texture conditions, we explore a more effective use of depth information through the Depth-guided Geometric Module (DGGM), which converts depth into explicit geometric priors and injects them into the attention mechanism without additional computational overhead. In addition, we propose Adaptive Dense Channel Integration, which adaptively balances the contributions of multi-layer features to produce more discriminative and generalizable visual representations. Extensive experiments on the OCID-VLG dataset, as well as in both simulation and real-world hardware, demonstrate that GeoLanG enables precise and robust language-guided grasping in complex, cluttered environments, paving the way toward more reliable multimodal robotic manipulation in real-world human-centric settings.