Afford-X: Generalizable and Slim Affordance Reasoning for Task-oriented Manipulation

๐Ÿ“… 2025-03-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the poor generalization and deployment challenges of object affordance reasoning in task-oriented robotic manipulation, this paper proposes an end-to-end vision-action semantic mapping framework. Methodologically, we introduce LVIS-Affโ€”a large-scale, multi-task affordance datasetโ€”and design Afford-X, a lightweight model featuring novel Verb Attention and Bidirectional Cross-Modal Fusion (Bi-Fusion) modules to enable perception-driven affordance modeling and efficient edge inference. Contributions include: (1) a 12.1% absolute performance gain over prior non-LLM approaches (+1.2% relative improvement), (2) only 187M parameters, and (3) inference speed 50ร— faster than the GPT-4V API. The framework is validated across multiple robotic platforms and real-world environments, demonstrating strong generalizability and practical deployability.

Technology Category

Application Category

๐Ÿ“ Abstract
Object affordance reasoning, the ability to infer object functionalities based on physical properties, is fundamental for task-oriented planning and activities in both humans and Artificial Intelligence (AI). This capability, required for planning and executing daily activities in a task-oriented manner, relies on commonsense knowledge of object physics and functionalities, extending beyond simple object recognition. Current computational models for affordance reasoning from perception lack generalizability, limiting their applicability in novel scenarios. Meanwhile, comprehensive Large Language Models (LLMs) with emerging reasoning capabilities are challenging to deploy on local devices for task-oriented manipulations. Here, we introduce LVIS-Aff, a large-scale dataset comprising 1,496 tasks and 119k images, designed to enhance the generalizability of affordance reasoning from perception. Utilizing this dataset, we develop Afford-X, an end-to-end trainable affordance reasoning model that incorporates Verb Attention and Bi-Fusion modules to improve multi-modal understanding. This model achieves up to a 12.1% performance improvement over the best-reported results from non-LLM methods, while also demonstrating a 1.2% enhancement compared to our previous conference paper. Additionally, it maintains a compact 187M parameter size and infers nearly 50 times faster than the GPT-4V API. Our work demonstrates the potential for efficient, generalizable affordance reasoning models that can be deployed on local devices for task-oriented manipulations. We showcase Afford-X's effectiveness in enabling task-oriented manipulations for robots across various tasks and environments, underscoring its efficiency and broad implications for advancing robotics and AI systems in real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Enhancing generalizability of affordance reasoning from perception.
Developing efficient models for task-oriented manipulation on local devices.
Improving multi-modal understanding in affordance reasoning models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

LVIS-Aff dataset enhances affordance reasoning generalizability.
Afford-X model uses Verb Attention and Bi-Fusion modules.
Compact 187M parameter size enables fast local deployment.
๐Ÿ”Ž Similar Papers
No similar papers found.
X
Xiaomeng Zhu
Institute for Artificial Intelligence, Peking University, Beijing 100091, China; Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong 999077, China
Yuyang Li
Yuyang Li
Institute for AI, Peking University
Robotic ManipulationTactile SensingHuman-Object Interaction
Leiyao Cui
Leiyao Cui
University of Chinese Academy of Sciences
Computer VisionRobotics
P
Pengfei Li
Institute for Al Industry Research, Tsinghua University, Beijing 100084, China
Huan-ang Gao
Huan-ang Gao
Ph.D. student, Tsinghua University
AgentVision & Robotics
Yixin Zhu
Yixin Zhu
Assistant Professor, Peking University
Computer VisionVisual ReasoningHuman-Robot Teaming
H
Hao Zhao
Institute for Al Industry Research, Tsinghua University, Beijing 100084, China