GLOVER: Generalizable Open-Vocabulary Affordance Reasoning for Task-Oriented Grasping

📅 2024-11-19

🏛️ arXiv.org

📈 Citations: 7

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing robotic grasping planners suffer from limited vision-language understanding and computationally expensive 3D radiance field modeling, hindering real-time, open-vocabulary inference of graspable regions on arbitrary objects. To address this, we propose Affordance-Aware Grasping Estimation (AGE): a novel framework that jointly leverages large language models’ world knowledge and visual functional reasoning—achieved by end-to-end fine-tuning a multimodal LLM in RGB feature space to predict object-level semantic affordances. AGE further introduces superquadric surface modeling for semantic-geometric alignment and a non-parametric grasp pose estimator. Evaluated on a newly curated dataset of 10k+ human-object interactions and 30 real-world desktop scenes, AGE achieves 86.0% part recognition accuracy and 76.3% grasp success rate. Functional reasoning and pose generation are accelerated by 29× and 40×, respectively. Cross-morphology generalization is validated on a humanoid robot with dexterous hands.

Technology Category

Application Category

📝 Abstract

Inferring affordable (i.e., graspable) parts of arbitrary objects based on human specifications is essential for robots advancing toward open-vocabulary manipulation. Current grasp planners, however, are hindered by limited vision-language comprehension and time-consuming 3D radiance modeling, restricting real-time, open-vocabulary interactions with objects. To address these limitations, we propose GLOVER, a unified Generalizable Open-Vocabulary Affordance Reasoning framework, which fine-tunes the Large Language Models (LLMs) to predict the visual affordance of graspable object parts within RGB feature space. We compile a dataset of over 10,000 images from human-object interactions, annotated with unified visual and linguistic affordance labels, to enable multi-modal fine-tuning. GLOVER inherits world knowledge and common-sense reasoning from LLMs, facilitating more fine-grained object understanding and sophisticated tool-use reasoning. To enable effective real-world deployment, we present Affordance-Aware Grasping Estimation (AGE), a non-parametric grasp planner that aligns the gripper pose with a superquadric surface derived from affordance data. In evaluations across 30 table-top real-world scenes, GLOVER achieves success rates of 86.0% in part identification and 76.3% in grasping, with speeds approximately 29 times faster in affordance reasoning and 40 times faster in grasping pose estimation than the previous state-of-the-art. We also validate the generalization across embodiments, showing effectiveness in humanoid robots with dexterous hands.

Problem

Research questions and friction points this paper is trying to address.

Enabling robots to grasp arbitrary objects based on human instructions

Overcoming limited vision-language understanding in current grasp planners

Achieving real-time open-vocabulary affordance reasoning and grasping

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes LLMs for visual affordance prediction

Uses multi-modal dataset for fine-tuning

Implements non-parametric grasp planner AGE

🔎 Similar Papers

UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models