🤖 AI Summary
3D object affordance prediction aims to localize functionally interactive regions on object surfaces, playing a critical role in embodied intelligence and human–robot interaction. Existing approaches rely heavily on demonstration-image supervision, exhibiting limited generalization and poor support for open-vocabulary scenarios. This paper proposes the first fine-tuning-free open-world 3D affordance grounding framework. We pioneer the discovery of implicit, generic affordance priors embedded within frozen text-to-image diffusion models (e.g., Stable Diffusion), and design a lightweight operability module coupled with a multi-source dense decoder to explicitly model cross-modal semantic associations via self-attention. Our method achieves significant improvements over state-of-the-art methods across multiple benchmarks—especially under cross-category and open-vocabulary settings—demonstrating both the efficacy and transferability of structured affordance knowledge inherently encoded in diffusion models for 3D understanding.
📝 Abstract
3D object affordance grounding aims to predict the touchable regions on a 3d object, which is crucial for human-object interaction, human-robot interaction, embodied perception, and robot learning. Recent advances tackle this problem via learning from demonstration images. However, these methods fail to capture the general affordance knowledge within the image, leading to poor generalization. To address this issue, we propose to use text-to-image diffusion models to extract the general affordance knowledge because we find that such models can generate semantically valid HOI images, which demonstrate that their internal representation space is highly correlated with real-world affordance concepts. Specifically, we introduce the DAG, a diffusion-based 3d affordance grounding framework, which leverages the frozen internal representations of the text-to-image diffusion model and unlocks affordance knowledge within the diffusion model to perform 3D affordance grounding. We further introduce an affordance block and a multi-source affordance decoder to endow 3D dense affordance prediction. Extensive experimental evaluations show that our model excels over well-established methods and exhibits open-world generalization.