🤖 AI Summary
This work investigates whether embodied agents can construct internal cognitive maps without explicit supervision on spatial relationships, moving beyond mere imitation of expert trajectories. To this end, the authors propose LASAR, a novel architecture that integrates episodic memory and semantic cognitive maps into a dual-memory system, coupled with Spatio-Temporal Context Representation Learning (ST-CRL). ST-CRL generates cognitive queries in simulated environments to form spatio-temporal sample pairs, enabling contrastive learning that drives the agent to develop consistent and generalizable latent cognitive maps from experience. Experiments demonstrate that this approach achieves a 2%–3.5% improvement in zero-shot generalization performance on the VLN-CE and VSI-Bench benchmarks, while also confirming the high self-consistency of the learned cognitive maps.
📝 Abstract
A fundamental challenge in embodied AI is verifying if agents build internal models of spatial structure or merely learn to mimic task-specific expert trajectories. This is critical as foundational approaches rooted in action-centric tasks (e.g., VLN) and reasoning-centric tasks (e.g., EQA) often share a common limitation: they lack a learning signal that forces them to encode fine-grained spatial relationships (like topology or distance) over long-range, fragmented experiences. To address this, we first propose LASAR, an architecture featuring a dual-memory system designed to maintain both episodic experiences and a semantic cognitive map. We then introduce Spatio-temporal Contextual Representation Learning (ST-CRL), a contrastive objective designed to train this architecture. ST-CRL leverages spatio-temporal cues from cognitive queries generated through annotated spatio-temporal context in simulation to build sample pairs, thereby forming the internal cognitive map from the agent's experiences. Experiments demonstrate that our method achieves 2\%-3.5\% gains in both zero-shot generalization on standard VLN-CE and VSI-Bench benchmarks. We also demonstrate that our proposed cognitive map has high self-consistency.