Learning Precise Affordances from Egocentric Videos for Robotic Manipulation

📅 2024-08-19

🏛️ arXiv.org

📈 Citations: 12

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Affordance research faces three key challenges: scarcity of large-scale, pixel-level fine-grained annotations; poor cross-object and cross-scene generalization; and lack of end-to-end deployable frameworks. To address these, we introduce the first large-scale, first-person video-based affordance dataset—automatically constructed and annotated for both graspable and functional regions—enabling unified modeling of graspable and functional affordances. We propose a Geometry-Guided Affordance Transformer (GKT) and a Depth Feature Injector (DFI) module to explicitly incorporate 3D geometric priors. Furthermore, we develop Aff-Grasp, an end-to-end framework integrating affordance perception and robotic grasping. GKT achieves a 15.9% mIoU improvement over state-of-the-art methods. In 179 real-robot trials, Aff-Grasp attains 95.5% affordance prediction accuracy and 77.1% grasping success rate, demonstrating significantly enhanced generalization across unseen objects and cluttered scenes.

Technology Category

Application Category

📝 Abstract

Affordance, defined as the potential actions that an object offers, is crucial for robotic manipulation tasks. A deep understanding of affordance can lead to more intelligent AI systems. For example, such knowledge directs an agent to grasp a knife by the handle for cutting and by the blade when passing it to someone. In this paper, we present a streamlined affordance learning system that encompasses data collection, effective model training, and robot deployment. First, we collect training data from egocentric videos in an automatic manner. Different from previous methods that focus only on the object graspable affordance and represent it as coarse heatmaps, we cover both graspable (e.g., object handles) and functional affordances (e.g., knife blades, hammer heads) and extract data with precise segmentation masks. We then propose an effective model, termed Geometry-guided Affordance Transformer (GKT), to train on the collected data. GKT integrates an innovative Depth Feature Injector (DFI) to incorporate 3D shape and geometric priors, enhancing the model's understanding of affordances. To enable affordance-oriented manipulation, we further introduce Aff-Grasp, a framework that combines GKT with a grasp generation model. For comprehensive evaluation, we create an affordance evaluation dataset with pixel-wise annotations, and design real-world tasks for robot experiments. The results show that GKT surpasses the state-of-the-art by 15.9% in mIoU, and Aff-Grasp achieves high success rates of 95.5% in affordance prediction and 77.1% in successful grasping among 179 trials, including evaluations with seen, unseen objects, and cluttered scenes.

Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity in affordance learning with automated annotation

Improving generalization across domains and novel object classes

Enabling real-world deployment for robotic manipulation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates precise affordance annotations autonomously from videos

Uses geometric data and foundation models for better generalization

Enables affordance-based robotic grasping and tool handover

🔎 Similar Papers

Learning by Watching: A Review of Video-based Learning Approaches for Robot Manipulation