REM: Evaluating LLM Embodied Spatial Reasoning through Multi-Frame Trajectories

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) lack the capability to construct viewpoint-invariant cognitive maps essential for embodied navigation, hindering object constancy recognition, cross-view spatial relation reasoning, and dynamic quantity tracking. To address this, we propose REM—a systematic benchmark for long-horizon embodied spatial reasoning, the first of its kind. REM leverages a controllable 3D simulation environment to generate multi-frame visual trajectories and uniformly evaluates three core capabilities: object persistence, spatial topological relations, and quantity consistency. It introduces fine-grained diagnostic metrics that, for the first time, reveal a substantial performance drop (>40% on average) in mainstream MLLMs on moderately complex tasks—exposing their fundamental limitation in maintaining stable cross-frame spatial representations. The code and dataset are publicly released.

Technology Category

Application Category

📝 Abstract

Humans build viewpoint-independent cognitive maps through navigation, enabling intuitive reasoning about object permanence and spatial relations. We argue that multimodal large language models (MLLMs), despite extensive video training, lack this fundamental spatial reasoning capability, a critical limitation for embodied applications. To demonstrate these limitations and drive research, we introduce REM (Reasoning over Embodied Multi-Frame Trajectories), a benchmark using controllable 3D environments for long-horizon embodied spatial reasoning. REM systematically evaluates key aspects like object permanence/distinction, spatial relationships, and numerical tracking across dynamic embodied viewpoints. Our evaluation shows that the best-performing current models exhibit promising overall performance, but become increasingly unreliable at even moderate complexity levels easily handled by humans. These findings highlight challenges MLLMs face in developing robust spatial representations from sequential visual input. Consequently, REM provides targeted metrics and diagnostics to foster improved spatial understanding in future models.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' spatial reasoning in embodied settings

Assessing object permanence and spatial relations in dynamic viewpoints

Benchmarking long-horizon embodied spatial reasoning capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for embodied spatial reasoning evaluation

Controllable 3D environments simulate dynamic viewpoints

Metrics target object permanence and spatial relationships

🔎 Similar Papers

TR-LLM: Integrating Trajectory Data for Scene-Aware LLM-Based Human Action Prediction