🤖 AI Summary
Current multimodal large language models (MLLMs) lack the capability to construct viewpoint-invariant cognitive maps essential for embodied navigation, hindering object constancy recognition, cross-view spatial relation reasoning, and dynamic quantity tracking. To address this, we propose REM—a systematic benchmark for long-horizon embodied spatial reasoning, the first of its kind. REM leverages a controllable 3D simulation environment to generate multi-frame visual trajectories and uniformly evaluates three core capabilities: object persistence, spatial topological relations, and quantity consistency. It introduces fine-grained diagnostic metrics that, for the first time, reveal a substantial performance drop (>40% on average) in mainstream MLLMs on moderately complex tasks—exposing their fundamental limitation in maintaining stable cross-frame spatial representations. The code and dataset are publicly released.
📝 Abstract
Humans build viewpoint-independent cognitive maps through navigation, enabling intuitive reasoning about object permanence and spatial relations. We argue that multimodal large language models (MLLMs), despite extensive video training, lack this fundamental spatial reasoning capability, a critical limitation for embodied applications. To demonstrate these limitations and drive research, we introduce REM (Reasoning over Embodied Multi-Frame Trajectories), a benchmark using controllable 3D environments for long-horizon embodied spatial reasoning. REM systematically evaluates key aspects like object permanence/distinction, spatial relationships, and numerical tracking across dynamic embodied viewpoints. Our evaluation shows that the best-performing current models exhibit promising overall performance, but become increasingly unreliable at even moderate complexity levels easily handled by humans. These findings highlight challenges MLLMs face in developing robust spatial representations from sequential visual input. Consequently, REM provides targeted metrics and diagnostics to foster improved spatial understanding in future models.