RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping

📅 2025-07-31

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

To address the weak functional perception of objects and the scarcity of large-scale, reasoning-oriented functional annotations in open-world robotic grasping, this paper introduces RAGNet—the first large-scale functional segmentation benchmark for general-purpose grasping. It comprises 273K multimodal images (captured from real scenes, robot viewpoints, and simulation), 180 object categories, and 26K category-agnostic, function-centric reasoning instructions. Leveraging this benchmark, we propose AffordanceNet: a unified framework integrating vision-language model pretraining with a functional-graph-guided grasping network to enable end-to-end learning from functional understanding to action generation. Extensive experiments demonstrate that our approach significantly improves generalization across multiple functional segmentation benchmarks and enhances deployment effectiveness on real robotic platforms.

Technology Category

Application Category

📝 Abstract

General robotic grasping systems require accurate object affordance perception in diverse open-world scenarios following human instructions. However, current studies suffer from the problem of lacking reasoning-based large-scale affordance prediction data, leading to considerable concern about open-world effectiveness. To address this limitation, we build a large-scale grasping-oriented affordance segmentation benchmark with human-like instructions, named RAGNet. It contains 273k images, 180 categories, and 26k reasoning instructions. The images cover diverse embodied data domains, such as wild, robot, ego-centric, and even simulation data. They are carefully annotated with an affordance map, while the difficulty of language instructions is largely increased by removing their category name and only providing functional descriptions. Furthermore, we propose a comprehensive affordance-based grasping framework, named AffordanceNet, which consists of a VLM pre-trained on our massive affordance data and a grasping network that conditions an affordance map to grasp the target. Extensive experiments on affordance segmentation benchmarks and real-robot manipulation tasks show that our model has a powerful open-world generalization ability. Our data and code is available at https://github.com/wudongming97/AffordanceNet.

Problem

Research questions and friction points this paper is trying to address.

Lacks reasoning-based large-scale affordance prediction data

Requires accurate object affordance perception in diverse scenarios

Needs open-world generalization for robotic grasping systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale affordance segmentation benchmark RAGNet

VLM pre-trained on affordance data

AffordanceNet framework for open-world grasping

🔎 Similar Papers

Learning Precise Affordances from Egocentric Videos for Robotic Manipulation

2024-08-19arXiv.orgCitations: 12