Attribute-based Object Grounding and Robot Grasp Detection with Spatial Reasoning

📅 2025-09-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address challenges including open-ended language expressions, occlusion from repeated objects, and high annotation costs, this paper proposes a language-driven robotic grasping method. Our approach introduces a bidirectional vision-language fusion module that integrates deep feature encoding with spatial attention mechanisms to enhance geometric-aware target localization and planar grasp pose estimation. The framework supports both pixel-level fully supervised and single-point weakly supervised training paradigms. In tabletop-scene experiments, the fully supervised variant (RGS) achieves 17.59 FPS with significantly improved localization and grasp accuracy over strong baselines. The weakly supervised variant (RGA) attains higher grasp success rates in both simulation and real-robot deployments, substantially reducing reliance on dense pixel-level annotations while maintaining robust performance.

Technology Category

Application Category

📝 Abstract
Enabling robots to grasp objects specified through natural language is essential for effective human-robot interaction, yet it remains a significant challenge. Existing approaches often struggle with open-form language expressions and typically assume unambiguous target objects without duplicates. Moreover, they frequently rely on costly, dense pixel-wise annotations for both object grounding and grasp configuration. We present Attribute-based Object Grounding and Robotic Grasping (OGRG), a novel framework that interprets open-form language expressions and performs spatial reasoning to ground target objects and predict planar grasp poses, even in scenes containing duplicated object instances. We investigate OGRG in two settings: (1) Referring Grasp Synthesis (RGS) under pixel-wise full supervision, and (2) Referring Grasp Affordance (RGA) using weakly supervised learning with only single-pixel grasp annotations. Key contributions include a bi-directional vision-language fusion module and the integration of depth information to enhance geometric reasoning, improving both grounding and grasping performance. Experiment results show that OGRG outperforms strong baselines in tabletop scenes with diverse spatial language instructions. In RGS, it operates at 17.59 FPS on a single NVIDIA RTX 2080 Ti GPU, enabling potential use in closed-loop or multi-object sequential grasping, while delivering superior grounding and grasp prediction accuracy compared to all the baselines considered. Under the weakly supervised RGA setting, OGRG also surpasses baseline grasp-success rates in both simulation and real-robot trials, underscoring the effectiveness of its spatial reasoning design. Project page: https://z.umn.edu/ogrg
Problem

Research questions and friction points this paper is trying to address.

Grounding objects from open-form language expressions with duplicates
Predicting planar grasp poses using spatial reasoning
Reducing reliance on costly dense pixel-wise annotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bi-directional vision-language fusion module
Integration of depth for geometric reasoning
Weakly supervised learning with single-pixel annotations
🔎 Similar Papers
No similar papers found.
Houjian Yu
Houjian Yu
Amazon, University of Minnesota
RoboticsComputer Vision
Zheming Zhou
Zheming Zhou
Amazon Lab126, Sunnyvale, CA, USA
M
Min Sun
Amazon Lab126, Sunnyvale, CA, USA; National Tsing Hua University, Taiwan
Omid Ghasemalizadeh
Omid Ghasemalizadeh
Applied Science Manager
3D PerceptionWorld ModelingRoboticsAugmented Reality
Yuyin Sun
Yuyin Sun
Amazon Lab126, Sunnyvale, CA, USA
Cheng-Hao Kuo
Cheng-Hao Kuo
Amazon
Computer Vision
A
Arnie Sen
Amazon Lab126, Sunnyvale, CA, USA
C
Changhyun Choi
Department of Electrical and Computer Engineering, Univ. of Minnesota, Minneapolis, USA