SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the limitation of existing video multimodal large language models (MLLMs) in lacking world-centric, persistent spatial representations, which hinders consistent 3D spatial reasoning. Inspired by the mammalian dual-stream system, the authors propose the first voxelized cognitive map integrated into a video MLLM, unifying fragmented egocentric observations into a coherent 3D metric space representation within an allocentric coordinate frame. They introduce a coordinate-guided deep iterative fusion mechanism and 3D rotational positional encoding to seamlessly inject map-level spatial knowledge. The approach achieves object permanence and preserves spatial topology, establishing a new state of the art on VSI-Bench and demonstrating strong out-of-distribution generalization across SPBench, SITE-Bench, and SPAR-Bench.

📝 Abstract

Recent multimodal large language models (MLLMs) have made remarkable progress in visual understanding and language-based reasoning, yet they lack a persistent world-centered representation for spatially consistent reasoning in 3D environments. Inspired by the mammalian dual-stream system, where semantic and spatial cues are processed separately and integrated into an allocentric cognitive map, we propose SpaceMind++, a video MLLM architecture that explicitly builds a voxelized cognitive map from RGB videos. This map reorganizes fragmented egocentric observations into a shared 3D metric representation, enabling the model to preserve object permanence and spatial topology across changing viewpoints. To make this allocentric representation usable by a pretrained video MLLM without disrupting its native visual-token interface, we introduce Coordinate-Guided Deep Iterative Fusion, a new mechanism that relays map-level spatial knowledge back into the original 2D visual features. This fusion is explicitly guided by coordinate embeddings and 3D Rotary Positional Encoding, which ground semantic interactions in metric 3D space, resembling the entorhinal binding of sensory features to metric space. Extensive experiments show that SpaceMind++ achieves new state-of-the-art performance on VSI-Bench. Furthermore, it demonstrates superior out-of-distribution generalization on SPBench, SITE-Bench, and SPAR-Bench, underscoring its robustness in unseen 3D environments.

Problem

Research questions and friction points this paper is trying to address.

allocentric cognitive maps

spatial reasoning

3D environments

video MLLMs

object permanence

Innovation

Methods, ideas, or system contributions that make the work stand out.

allocentric cognitive map

voxelized representation

Coordinate-Guided Deep Iterative Fusion