🤖 AI Summary
This work addresses the fine-grained localization of 3D object affordances under open-vocabulary settings, tackling three key challenges: low part-level localization accuracy, semantic ambiguity across multiple affordance regions, and scarcity of large-scale annotated data. To this end, we introduce the first open-vocabulary 3D affordance benchmark—comprising 150K instances—with dense 3D heatmap annotations and explicit support for cross-domain generalization. We propose a lightweight vision-language model that integrates pretrained part-aware visual encoders (e.g., SAM, OpenCLIP) with a text-conditioned heatmap decoder, augmented by an automated pipeline for synthetic 3D scene generation and annotation. Our method achieves state-of-the-art performance across multiple 2D and 3D affordance benchmarks, demonstrating significant improvements in cross-object and cross-category generalization. The benchmark dataset, model code, and synthesis tools are publicly released.
📝 Abstract
Affordance grounding-localizing object regions based on natural language descriptions of interactions-is a critical challenge for enabling intelligent agents to understand and interact with their environments. However, this task remains challenging due to the need for fine-grained part-level localization, the ambiguity arising from multiple valid interaction regions, and the scarcity of large-scale datasets. In this work, we introduce Affogato, a large-scale benchmark comprising 150K instances, annotated with open-vocabulary text descriptions and corresponding 3D affordance heatmaps across a diverse set of objects and interactions. Building on this benchmark, we develop simple yet effective vision-language models that leverage pretrained part-aware vision backbones and a text-conditional heatmap decoder. Our models trained with the Affogato dataset achieve promising performance on the existing 2D and 3D benchmarks, and notably, exhibit effectiveness in open-vocabulary cross-domain generalization. The Affogato dataset is shared in public: https://huggingface.co/datasets/project-affogato/affogato