Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work addresses the fine-grained localization of 3D object affordances under open-vocabulary settings, tackling three key challenges: low part-level localization accuracy, semantic ambiguity across multiple affordance regions, and scarcity of large-scale annotated data. To this end, we introduce the first open-vocabulary 3D affordance benchmark—comprising 150K instances—with dense 3D heatmap annotations and explicit support for cross-domain generalization. We propose a lightweight vision-language model that integrates pretrained part-aware visual encoders (e.g., SAM, OpenCLIP) with a text-conditioned heatmap decoder, augmented by an automated pipeline for synthetic 3D scene generation and annotation. Our method achieves state-of-the-art performance across multiple 2D and 3D affordance benchmarks, demonstrating significant improvements in cross-object and cross-category generalization. The benchmark dataset, model code, and synthesis tools are publicly released.

Technology Category

Application Category

📝 Abstract

Affordance grounding-localizing object regions based on natural language descriptions of interactions-is a critical challenge for enabling intelligent agents to understand and interact with their environments. However, this task remains challenging due to the need for fine-grained part-level localization, the ambiguity arising from multiple valid interaction regions, and the scarcity of large-scale datasets. In this work, we introduce Affogato, a large-scale benchmark comprising 150K instances, annotated with open-vocabulary text descriptions and corresponding 3D affordance heatmaps across a diverse set of objects and interactions. Building on this benchmark, we develop simple yet effective vision-language models that leverage pretrained part-aware vision backbones and a text-conditional heatmap decoder. Our models trained with the Affogato dataset achieve promising performance on the existing 2D and 3D benchmarks, and notably, exhibit effectiveness in open-vocabulary cross-domain generalization. The Affogato dataset is shared in public: https://huggingface.co/datasets/project-affogato/affogato

Problem

Research questions and friction points this paper is trying to address.

Localizing object regions via natural language interaction descriptions

Addressing scarcity of large-scale affordance grounding datasets

Enabling open-vocabulary cross-domain generalization in affordance tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset with 150K affordance instances

Pretrained part-aware vision backbones utilization

Text-conditional heatmap decoder for localization

🔎 Similar Papers

Text2Afford: Probing Object Affordance Prediction abilities of Language Models solely from Text