Learning Precise Affordances from Egocentric Videos for Robotic Manipulation

📅 2024-08-19
🏛️ arXiv.org
📈 Citations: 12
Influential: 0
📄 PDF
🤖 AI Summary
Affordance research faces three key challenges: scarcity of large-scale, pixel-level fine-grained annotations; poor cross-object and cross-scene generalization; and lack of end-to-end deployable frameworks. To address these, we introduce the first large-scale, first-person video-based affordance dataset—automatically constructed and annotated for both graspable and functional regions—enabling unified modeling of graspable and functional affordances. We propose a Geometry-Guided Affordance Transformer (GKT) and a Depth Feature Injector (DFI) module to explicitly incorporate 3D geometric priors. Furthermore, we develop Aff-Grasp, an end-to-end framework integrating affordance perception and robotic grasping. GKT achieves a 15.9% mIoU improvement over state-of-the-art methods. In 179 real-robot trials, Aff-Grasp attains 95.5% affordance prediction accuracy and 77.1% grasping success rate, demonstrating significantly enhanced generalization across unseen objects and cluttered scenes.

Technology Category

Application Category

📝 Abstract
Affordance, defined as the potential actions that an object offers, is crucial for robotic manipulation tasks. A deep understanding of affordance can lead to more intelligent AI systems. For example, such knowledge directs an agent to grasp a knife by the handle for cutting and by the blade when passing it to someone. In this paper, we present a streamlined affordance learning system that encompasses data collection, effective model training, and robot deployment. First, we collect training data from egocentric videos in an automatic manner. Different from previous methods that focus only on the object graspable affordance and represent it as coarse heatmaps, we cover both graspable (e.g., object handles) and functional affordances (e.g., knife blades, hammer heads) and extract data with precise segmentation masks. We then propose an effective model, termed Geometry-guided Affordance Transformer (GKT), to train on the collected data. GKT integrates an innovative Depth Feature Injector (DFI) to incorporate 3D shape and geometric priors, enhancing the model's understanding of affordances. To enable affordance-oriented manipulation, we further introduce Aff-Grasp, a framework that combines GKT with a grasp generation model. For comprehensive evaluation, we create an affordance evaluation dataset with pixel-wise annotations, and design real-world tasks for robot experiments. The results show that GKT surpasses the state-of-the-art by 15.9% in mIoU, and Aff-Grasp achieves high success rates of 95.5% in affordance prediction and 77.1% in successful grasping among 179 trials, including evaluations with seen, unseen objects, and cluttered scenes.
Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity in affordance learning with automated annotation
Improving generalization across domains and novel object classes
Enabling real-world deployment for robotic manipulation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates precise affordance annotations autonomously from videos
Uses geometric data and foundation models for better generalization
Enables affordance-based robotic grasping and tool handover
G
Gen Li
School of Informatics, University of Edinburgh, Edinburgh, UK
N
Nikolaos Tsagkas
School of Informatics, University of Edinburgh, Edinburgh, UK
Jifei Song
Jifei Song
Huawei Noah’s Ark Lab
Neural RenderingComputer VisionDeep LearningImage ProcessingSpeech Processing
R
Ruaridh Mon-Williams
School of Informatics, University of Edinburgh, Edinburgh, UK
S
S. Vijayakumar
School of Informatics, University of Edinburgh, Edinburgh, UK
Kun Shao
Kun Shao
Huawei
AI Agentreinforcement learningmulti-agent systemsembodied AIgame AI
Laura Sevilla-Lara
Laura Sevilla-Lara
Reader at University of Edinburgh
Computer Vision