A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Robotic manipulation faces a critical challenge in insufficient spatial affordance modeling—i.e., “where and how to interact”—limiting performance on complex tasks such as blackboard wiping and block stacking. To address this, we propose a hierarchical affordance-aware diffusion model. First, we introduce an embodiment-agnostic affordance representation that jointly models contact points and subsequent motion trajectories. Second, we design a position-offset attention mechanism and a spatial information aggregation layer to enhance geometric awareness and cross-scale reasoning. Third, we adopt a two-stage training paradigm: contact-point pretraining followed by trajectory fine-tuning, improving generalization across tasks and platforms. Evaluated on four robotic arms—Franka, Kinova, Realman, and Dobot—the method significantly improves success rates on complex manipulation tasks and demonstrates strong cross-platform generalization. Moreover, it supports real-time inference and deployment in realistic scenarios.

Technology Category

Application Category

📝 Abstract
Robotic manipulation faces critical challenges in understanding spatial affordances--the"where"and"how"of object interactions--essential for complex manipulation tasks like wiping a board or stacking objects. Existing methods, including modular-based and end-to-end approaches, often lack robust spatial reasoning capabilities. Unlike recent point-based and flow-based affordance methods that focus on dense spatial representations or trajectory modeling, we propose A0, a hierarchical affordance-aware diffusion model that decomposes manipulation tasks into high-level spatial affordance understanding and low-level action execution. A0 leverages the Embodiment-Agnostic Affordance Representation, which captures object-centric spatial affordances by predicting contact points and post-contact trajectories. A0 is pre-trained on 1 million contact points data and fine-tuned on annotated trajectories, enabling generalization across platforms. Key components include Position Offset Attention for motion-aware feature extraction and a Spatial Information Aggregation Layer for precise coordinate mapping. The model's output is executed by the action execution module. Experiments on multiple robotic systems (Franka, Kinova, Realman, and Dobot) demonstrate A0's superior performance in complex tasks, showcasing its efficiency, flexibility, and real-world applicability.
Problem

Research questions and friction points this paper is trying to address.

Understanding spatial affordances for robotic manipulation tasks
Improving spatial reasoning in modular and end-to-end methods
Generalizing manipulation across different robotic platforms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical affordance-aware diffusion model
Embodiment-Agnostic Affordance Representation
Position Offset Attention feature extraction
🔎 Similar Papers
No similar papers found.
Rongtao Xu
Rongtao Xu
MBZUAI << CASIA << HUST
Intelligent RobotEmbodied AIVLAVLMSpatialtemporal AI
J
Jian Zhang
MBZUAI
M
Minghao Guo
MBZUAI
Y
Youpeng Wen
Sun Yat-sen University
H
Haoting Yang
Southern University of Science and Technology
Min Lin
Min Lin
Principal Research Scientist, Sea AI Lab
Artificial Intelligence
J
Jianzheng Huang
Southern University of Science and Technology
Z
Zhe Li
Southern University of Science and Technology
K
Kaidong Zhang
Sun Yat-sen University
L
Liqiong Wang
Southern University of Science and Technology
Yuxuan Kuang
Yuxuan Kuang
Carnegie Mellon University
Robotics3D Computer VisionMachine Learning
Meng Cao
Meng Cao
Postdoc, Carnegie Mellon University
Psychology
F
Feng Zheng
Southern University of Science and Technology
Xiaodan Liang
Xiaodan Liang
Professor of Computer Science, Sun Yat-sen University, MBZUAI, CMU, NUS
Computer visionEmbodied AIMachine learning