🤖 AI Summary
This work addresses the challenge that foundation models face in constructing, updating, and maintaining coherent spatial beliefs in partially observable environments. We propose a “Theory of Space” framework that systematically evaluates agents’ ability to build cognitive maps and form revisable spatial beliefs from sequences of local observations through curiosity-driven active exploration tasks. By introducing a spatial belief probing mechanism—integrating cognitive mapping benchmarks, false-belief paradigms, and comparisons with procedural agents—we uncover critical limitations in current models, including a performance gap between active and passive settings, inefficient exploration, unstable beliefs, and belief inertia. Our experiments demonstrate that foundation models suffer significant performance degradation during active exploration and that vision-based models exhibit greater difficulty than text-based models in revising outdated spatial beliefs.
📝 Abstract
Spatial embodied intelligence requires agents to act to acquire information under partial observability. While multimodal foundation models excel at passive perception, their capacity for active, self-directed exploration remains understudied. We propose Theory of Space, defined as an agent's ability to actively acquire information through self-directed, active exploration and to construct, revise, and exploit a spatial belief from sequential, partial observations. We evaluate this through a benchmark where the goal is curiosity-driven exploration to build an accurate cognitive map. A key innovation is spatial belief probing, which prompts models to reveal their internal spatial representations at each step. Our evaluation of state-of-the-art models reveals several critical bottlenecks. First, we identify an Active-Passive Gap, where performance drops significantly when agents must autonomously gather information. Second, we find high inefficiency, as models explore unsystematically compared to program-based proxies. Through belief probing, we diagnose that while perception is an initial bottleneck, global beliefs suffer from instability that causes spatial knowledge to degrade over time. Finally, using a false belief paradigm, we uncover Belief Inertia, where agents fail to update obsolete priors with new evidence. This issue is present in text-based agents but is particularly severe in vision-based models. Our findings suggest that current foundation models struggle to maintain coherent, revisable spatial beliefs during active exploration.