🤖 AI Summary
This work addresses the challenge of precisely localizing fine-grained functional regions—such as handles and buttons—in 3D scenes using zero-shot vision-language systems, where such regions are often small, visually ambiguous, and repetitive. To this end, the authors propose AFFORDMEM, a novel framework that introduces, for the first time, a two-level memory mechanism requiring no model fine-tuning. A cross-scene, category-level memory guides a frozen vision-language model to attend to manipulable sub-regions, while an intra-scene spatial memory leverages structured scene graphs to resolve spatial referring relationships. Relying solely on a reusable RGB memory bank and 3D spatial modeling—without any annotations or training on target scenes—the method achieves AP50 scores on SceneFun3D that surpass existing zero-shot approaches by 3.23 and 3.7 points, respectively. Ablation studies confirm the complementary benefits of the two memory mechanisms.
📝 Abstract
Functional affordance grounding requires more than recognizing an object: an agent must localize the specific region that supports an interaction, such as the handle to pull or the button to press. This is difficult for training-free vision-language pipelines because actionable regions are often small, visually ambiguous, and repeated across multiple same-category instances in a scene. We propose AFFORDMEM, a framework that grounds 3D functional affordances by remembering geometry at two levels. The first is cross-scene affordance memory: the agent maintains a category-level memory bank of RGB images with affordance regions rendered as overlays, and recalls the most informative examples at query time to guide a frozen VLM toward small operable subregions that text-only prompting consistently misses. The second is in-scene spatial memory: as the agent processes the scene, it organizes candidate instances and their 3D spatial relations into a structured scene graph, enabling the language model to resolve references over distant or currently unobserved candidates such as "the second handle from the top." AFFORDMEM requires no model fine-tuning and no target-scene annotation, using a reusable memory bank built from source scenes. On SceneFun3D, our method improves AP50 over the prior training-free state of the art by 3.23 on Split 0 and 3.7 on Split 1. Ablation studies support complementary benefits: cross-scene affordance memory improves fine-grained localization, while in-scene spatial memory provides the larger gain on spatially qualified queries. The project homepage is available at the project page.