🤖 AI Summary
This work investigates the limitations of pretrained language models (PTLMs) and vision-language models (VLMs) in text-only affordance reasoning—particularly for unconventional or rare object functions. To address this, we propose the first purely text-driven, sentence-level affordance probing framework; construct Text2Afford, the first in-the-wild dataset for language grounding with 15 fine-grained functional annotations; and introduce a consistency verification mechanism. We conduct unified cross-architecture evaluation (LLMs/VLMs) via prompt engineering and few-shot fine-tuning. Experiments reveal that state-of-the-art PTLMs achieve <45% accuracy on unconventional affordances; VLMs show no significant gain from visual modality; and few-shot fine-tuning yields an average 22.6% improvement, confirming functional knowledge plasticity. This study is the first to systematically expose structural deficits in multimodal models’ implicit affordance knowledge and establishes a new benchmark and methodology for text-driven functional understanding.
📝 Abstract
We investigate the knowledge of object affordances in pre-trained language models (LMs) and pre-trained Vision-Language models (VLMs).A growing body of literature shows that PTLMs fail inconsistently and non-intuitively, demonstrating a lack of reasoning and grounding. To take a first step toward quantifying the effect of grounding (or lack thereof), we curate a novel and comprehensive dataset of object affordances – Text2Afford, characterized by 15 affordance classes. Unlike affordance datasets collected in vision and language domains, we annotate in-the-wild sentences with objects and affordances. Experimental results reveal that PTLMs exhibit limited reasoning abilities when it comes to uncommon object affordances. We also observe that pre-trained VLMs do not necessarily capture object affordances effectively. Through few-shot fine-tuning, we demonstrate improvement in affordance knowledge in PTLMs and VLMs. Our research contributes a novel dataset for language grounding tasks, and presents insights into LM capabilities, advancing the understanding of object affordances.