Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the challenge of precisely localizing fine-grained functional regions—such as handles and buttons—in 3D scenes using zero-shot vision-language systems, where such regions are often small, visually ambiguous, and repetitive. To this end, the authors propose AFFORDMEM, a novel framework that introduces, for the first time, a two-level memory mechanism requiring no model fine-tuning. A cross-scene, category-level memory guides a frozen vision-language model to attend to manipulable sub-regions, while an intra-scene spatial memory leverages structured scene graphs to resolve spatial referring relationships. Relying solely on a reusable RGB memory bank and 3D spatial modeling—without any annotations or training on target scenes—the method achieves AP50 scores on SceneFun3D that surpass existing zero-shot approaches by 3.23 and 3.7 points, respectively. Ablation studies confirm the complementary benefits of the two memory mechanisms.

📝 Abstract

Functional affordance grounding requires more than recognizing an object: an agent must localize the specific region that supports an interaction, such as the handle to pull or the button to press. This is difficult for training-free vision-language pipelines because actionable regions are often small, visually ambiguous, and repeated across multiple same-category instances in a scene. We propose AFFORDMEM, a framework that grounds 3D functional affordances by remembering geometry at two levels. The first is cross-scene affordance memory: the agent maintains a category-level memory bank of RGB images with affordance regions rendered as overlays, and recalls the most informative examples at query time to guide a frozen VLM toward small operable subregions that text-only prompting consistently misses. The second is in-scene spatial memory: as the agent processes the scene, it organizes candidate instances and their 3D spatial relations into a structured scene graph, enabling the language model to resolve references over distant or currently unobserved candidates such as "the second handle from the top." AFFORDMEM requires no model fine-tuning and no target-scene annotation, using a reusable memory bank built from source scenes. On SceneFun3D, our method improves AP50 over the prior training-free state of the art by 3.23 on Split 0 and 3.7 on Split 1. Ablation studies support complementary benefits: cross-scene affordance memory improves fine-grained localization, while in-scene spatial memory provides the larger gain on spatially qualified queries. The project homepage is available at the project page.

Problem

Research questions and friction points this paper is trying to address.

functional affordance grounding

3D scene understanding

vision-language models

spatial reasoning

cross-scene memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

functional affordance grounding

cross-scene memory

in-scene spatial memory