🤖 AI Summary
Current multimodal large language models (MLLMs) for embodied agents lack structured spatial memory and exhibit only passive, reactive behavior, severely limiting their generalization and adaptability in complex real-world environments. To address this, we propose BSC-Nav—a brain-inspired unified navigation framework—that introduces the first egocentricity-free, dynamically retrievable cognitive map integrating landmark-, path-, and topology-level spatial knowledge. BSC-Nav employs a self-supervised autoencoder for map construction, context-aware trajectory modeling, and semantic-aligned spatial knowledge retrieval to enable MLLM-driven vision-language spatial reasoning. Evaluated across diverse navigation benchmarks, BSC-Nav achieves state-of-the-art performance, demonstrates strong zero-shot generalization, and supports adaptive, goal-directed behaviors in realistic settings. Our approach establishes a novel, scalable paradigm for embodied spatial intelligence grounded in neurocognitive principles.
📝 Abstract
Spatial cognition enables adaptive goal-directed behavior by constructing internal models of space. Robust biological systems consolidate spatial knowledge into three interconnected forms: extit{landmarks} for salient cues, extit{route knowledge} for movement trajectories, and extit{survey knowledge} for map-like representations. While recent advances in multi-modal large language models (MLLMs) have enabled visual-language reasoning in embodied agents, these efforts lack structured spatial memory and instead operate reactively, limiting their generalization and adaptability in complex real-world environments. Here we present Brain-inspired Spatial Cognition for Navigation (BSC-Nav), a unified framework for constructing and leveraging structured spatial memory in embodied agents. BSC-Nav builds allocentric cognitive maps from egocentric trajectories and contextual cues, and dynamically retrieves spatial knowledge aligned with semantic goals. Integrated with powerful MLLMs, BSC-Nav achieves state-of-the-art efficacy and efficiency across diverse navigation tasks, demonstrates strong zero-shot generalization, and supports versatile embodied behaviors in the real physical world, offering a scalable and biologically grounded path toward general-purpose spatial intelligence.