Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale

📅 2025-06-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the fine-grained localization of 3D object affordances under open-vocabulary settings, tackling three key challenges: low part-level localization accuracy, semantic ambiguity across multiple affordance regions, and scarcity of large-scale annotated data. To this end, we introduce the first open-vocabulary 3D affordance benchmark—comprising 150K instances—with dense 3D heatmap annotations and explicit support for cross-domain generalization. We propose a lightweight vision-language model that integrates pretrained part-aware visual encoders (e.g., SAM, OpenCLIP) with a text-conditioned heatmap decoder, augmented by an automated pipeline for synthetic 3D scene generation and annotation. Our method achieves state-of-the-art performance across multiple 2D and 3D affordance benchmarks, demonstrating significant improvements in cross-object and cross-category generalization. The benchmark dataset, model code, and synthesis tools are publicly released.

Technology Category

Application Category

📝 Abstract
Affordance grounding-localizing object regions based on natural language descriptions of interactions-is a critical challenge for enabling intelligent agents to understand and interact with their environments. However, this task remains challenging due to the need for fine-grained part-level localization, the ambiguity arising from multiple valid interaction regions, and the scarcity of large-scale datasets. In this work, we introduce Affogato, a large-scale benchmark comprising 150K instances, annotated with open-vocabulary text descriptions and corresponding 3D affordance heatmaps across a diverse set of objects and interactions. Building on this benchmark, we develop simple yet effective vision-language models that leverage pretrained part-aware vision backbones and a text-conditional heatmap decoder. Our models trained with the Affogato dataset achieve promising performance on the existing 2D and 3D benchmarks, and notably, exhibit effectiveness in open-vocabulary cross-domain generalization. The Affogato dataset is shared in public: https://huggingface.co/datasets/project-affogato/affogato
Problem

Research questions and friction points this paper is trying to address.

Localizing object regions via natural language interaction descriptions
Addressing scarcity of large-scale affordance grounding datasets
Enabling open-vocabulary cross-domain generalization in affordance tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset with 150K affordance instances
Pretrained part-aware vision backbones utilization
Text-conditional heatmap decoder for localization
🔎 Similar Papers
No similar papers found.
Junha Lee
Junha Lee
POSTECH
Computer Vision
E
Eunha Park
Pohang University of Science and Technology (POSTECH)
Chunghyun Park
Chunghyun Park
POSTECH
Computer VisionMachine Learning3D Vision
D
Dahyun Kang
Pohang University of Science and Technology (POSTECH)
M
Minsu Cho
Pohang University of Science and Technology (POSTECH), RLWRLD