DAG: Unleash the Potential of Diffusion Model for Open-Vocabulary 3D Affordance Grounding

📅 2025-08-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
3D object affordance prediction aims to localize functionally interactive regions on object surfaces, playing a critical role in embodied intelligence and human–robot interaction. Existing approaches rely heavily on demonstration-image supervision, exhibiting limited generalization and poor support for open-vocabulary scenarios. This paper proposes the first fine-tuning-free open-world 3D affordance grounding framework. We pioneer the discovery of implicit, generic affordance priors embedded within frozen text-to-image diffusion models (e.g., Stable Diffusion), and design a lightweight operability module coupled with a multi-source dense decoder to explicitly model cross-modal semantic associations via self-attention. Our method achieves significant improvements over state-of-the-art methods across multiple benchmarks—especially under cross-category and open-vocabulary settings—demonstrating both the efficacy and transferability of structured affordance knowledge inherently encoded in diffusion models for 3D understanding.

Technology Category

Application Category

📝 Abstract
3D object affordance grounding aims to predict the touchable regions on a 3d object, which is crucial for human-object interaction, human-robot interaction, embodied perception, and robot learning. Recent advances tackle this problem via learning from demonstration images. However, these methods fail to capture the general affordance knowledge within the image, leading to poor generalization. To address this issue, we propose to use text-to-image diffusion models to extract the general affordance knowledge because we find that such models can generate semantically valid HOI images, which demonstrate that their internal representation space is highly correlated with real-world affordance concepts. Specifically, we introduce the DAG, a diffusion-based 3d affordance grounding framework, which leverages the frozen internal representations of the text-to-image diffusion model and unlocks affordance knowledge within the diffusion model to perform 3D affordance grounding. We further introduce an affordance block and a multi-source affordance decoder to endow 3D dense affordance prediction. Extensive experimental evaluations show that our model excels over well-established methods and exhibits open-world generalization.
Problem

Research questions and friction points this paper is trying to address.

Improve generalization in 3D affordance grounding
Leverage diffusion models for affordance knowledge
Enable open-vocabulary 3D affordance prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages text-to-image diffusion models
Introduces affordance block and decoder
Unlocks diffusion model affordance knowledge
🔎 Similar Papers
No similar papers found.
H
Hanqing Wang
The Hong Kong University of Science and Technology (GZ)
Z
Zhenhao Zhang
ShanghaiTech University
Kaiyang Ji
Kaiyang Ji
MS student of Computer Science, Shanghaitech University
Computer VisionGenerative ModelsEmbodied AI
Mingyu Liu
Mingyu Liu
Technical University of Munich
Computer VisionDeep Learning
W
Wenti Yin
Huazhong University of Science and Technology
Yuchao Chen
Yuchao Chen
Huazhong University of Science and Technology
Z
Zhirui Liu
ShanghaiTech University
X
Xiangyu Zeng
Shanghai AI Lab, Nanjing University
T
Tianxiang Gui
ShanghaiTech University
H
Hangxing Zhang
ShanghaiTech University