🤖 AI Summary
Existing robotic grasping planners suffer from limited vision-language understanding and computationally expensive 3D radiance field modeling, hindering real-time, open-vocabulary inference of graspable regions on arbitrary objects. To address this, we propose Affordance-Aware Grasping Estimation (AGE): a novel framework that jointly leverages large language models’ world knowledge and visual functional reasoning—achieved by end-to-end fine-tuning a multimodal LLM in RGB feature space to predict object-level semantic affordances. AGE further introduces superquadric surface modeling for semantic-geometric alignment and a non-parametric grasp pose estimator. Evaluated on a newly curated dataset of 10k+ human-object interactions and 30 real-world desktop scenes, AGE achieves 86.0% part recognition accuracy and 76.3% grasp success rate. Functional reasoning and pose generation are accelerated by 29× and 40×, respectively. Cross-morphology generalization is validated on a humanoid robot with dexterous hands.
📝 Abstract
Inferring affordable (i.e., graspable) parts of arbitrary objects based on human specifications is essential for robots advancing toward open-vocabulary manipulation. Current grasp planners, however, are hindered by limited vision-language comprehension and time-consuming 3D radiance modeling, restricting real-time, open-vocabulary interactions with objects. To address these limitations, we propose GLOVER, a unified Generalizable Open-Vocabulary Affordance Reasoning framework, which fine-tunes the Large Language Models (LLMs) to predict the visual affordance of graspable object parts within RGB feature space. We compile a dataset of over 10,000 images from human-object interactions, annotated with unified visual and linguistic affordance labels, to enable multi-modal fine-tuning. GLOVER inherits world knowledge and common-sense reasoning from LLMs, facilitating more fine-grained object understanding and sophisticated tool-use reasoning. To enable effective real-world deployment, we present Affordance-Aware Grasping Estimation (AGE), a non-parametric grasp planner that aligns the gripper pose with a superquadric surface derived from affordance data. In evaluations across 30 table-top real-world scenes, GLOVER achieves success rates of 86.0% in part identification and 76.3% in grasping, with speeds approximately 29 times faster in affordance reasoning and 40 times faster in grasping pose estimation than the previous state-of-the-art. We also validate the generalization across embodiments, showing effectiveness in humanoid robots with dexterous hands.