🤖 AI Summary
Robotic manipulation faces a critical challenge in insufficient spatial affordance modeling—i.e., “where and how to interact”—limiting performance on complex tasks such as blackboard wiping and block stacking. To address this, we propose a hierarchical affordance-aware diffusion model. First, we introduce an embodiment-agnostic affordance representation that jointly models contact points and subsequent motion trajectories. Second, we design a position-offset attention mechanism and a spatial information aggregation layer to enhance geometric awareness and cross-scale reasoning. Third, we adopt a two-stage training paradigm: contact-point pretraining followed by trajectory fine-tuning, improving generalization across tasks and platforms. Evaluated on four robotic arms—Franka, Kinova, Realman, and Dobot—the method significantly improves success rates on complex manipulation tasks and demonstrates strong cross-platform generalization. Moreover, it supports real-time inference and deployment in realistic scenarios.
📝 Abstract
Robotic manipulation faces critical challenges in understanding spatial affordances--the"where"and"how"of object interactions--essential for complex manipulation tasks like wiping a board or stacking objects. Existing methods, including modular-based and end-to-end approaches, often lack robust spatial reasoning capabilities. Unlike recent point-based and flow-based affordance methods that focus on dense spatial representations or trajectory modeling, we propose A0, a hierarchical affordance-aware diffusion model that decomposes manipulation tasks into high-level spatial affordance understanding and low-level action execution. A0 leverages the Embodiment-Agnostic Affordance Representation, which captures object-centric spatial affordances by predicting contact points and post-contact trajectories. A0 is pre-trained on 1 million contact points data and fine-tuned on annotated trajectories, enabling generalization across platforms. Key components include Position Offset Attention for motion-aware feature extraction and a Spatial Information Aggregation Layer for precise coordinate mapping. The model's output is executed by the action execution module. Experiments on multiple robotic systems (Franka, Kinova, Realman, and Dobot) demonstrate A0's superior performance in complex tasks, showcasing its efficiency, flexibility, and real-world applicability.