🤖 AI Summary
This work addresses the limitation of existing embodied agents in performing viewpoint-dependent geometric reasoning—such as line-of-sight, occlusion, and reachability—due to inadequate spatial memory systems. The authors propose a novel spatial memory framework that, for the first time, employs differentiable rendering as an interface between 3D scene representations and spatial reasoning. By generating visual evidence on-demand from queried implicit viewpoints, the method enables geometrically consistent, dynamic viewpoint reasoning. It integrates viewpoint-conditioned image generation and seamlessly leverages off-the-shelf vision-language models. Evaluated in the AI2-THOR environment, the approach significantly outperforms current baselines, achieving higher accuracy on viewpoint-dependent visibility and occlusion question-answering tasks.
📝 Abstract
Embodied reasoning is inherently viewpoint-dependent: what is visible, occluded, or reachable depends critically on where the agent stands. However, existing spatial memory systems for embodied agents typically store either multi-view observations or object-centric abstractions, making it difficult to perform reasoning with explicit geometric grounding. We introduce RenderMem, a spatial memory framework that treats rendering as the interface between 3D world representations and spatial reasoning. Instead of storing fixed observations, RenderMem maintains a 3D scene representation and generates query-conditioned visual evidence by rendering the scene from viewpoints implied by the query. This enables embodied agents to reason directly about line-of-sight, visibility, and occlusion from arbitrary perspectives. RenderMem is fully compatible with existing vision-language models and requires no modification to standard architectures. Experiments in the AI2-THOR environment show consistent improvements on viewpoint-dependent visibility and occlusion queries over prior memory baselines.