🤖 AI Summary
Existing vision-and-language navigation (VLN) methods face two key bottlenecks in long-term memory utilization: (1) rigid memory access mechanisms—relying either on global memory integration or fixed-size window retrieval—and (2) memory storage limited to environmental observations, neglecting navigational policy representations. To address these, we propose an imagination-guided memory retrieval framework. It employs a language-conditioned world model to generate predictive state imaginings as dynamic, task-aware queries, enabling precise retrieval of salient experiences from a hybrid-perspective memory bank—jointly encoding both environmental observations and behavioral patterns. Crucially, the framework unifies predictive query generation with experience storage and introduces a dedicated navigation encoder to enhance policy-aware representation learning. Evaluated across ten diverse environments, our method achieves a +5.4% improvement in Success weighted by Path Length (SPL) on the IR2R benchmark, accelerates training by 8.3×, reduces inference memory consumption by 74%, and significantly improves generalization and efficiency for long-horizon navigation.
📝 Abstract
Vision-and-Language Navigation (VLN) requires agents to follow natural language instructions through environments, with memory-persistent variants demanding progressive improvement through accumulated experience. Existing approaches for memory-persistent VLN face critical limitations: they lack effective memory access mechanisms, instead relying on entire memory incorporation or fixed-horizon lookup, and predominantly store only environmental observations while neglecting navigation behavioral patterns that encode valuable decision-making strategies. We present Memoir, which employs imagination as a retrieval mechanism grounded by explicit memory: a world model imagines future navigation states as queries to selectively retrieve relevant environmental observations and behavioral histories. The approach comprises: 1) a language-conditioned world model that imagines future states serving dual purposes: encoding experiences for storage and generating retrieval queries; 2) Hybrid Viewpoint-Level Memory that anchors both observations and behavioral patterns to viewpoints, enabling hybrid retrieval; and 3) an experience-augmented navigation model that integrates retrieved knowledge through specialized encoders. Extensive evaluation across diverse memory-persistent VLN benchmarks with 10 distinctive testing scenarios demonstrates Memoir's effectiveness: significant improvements across all scenarios, with 5.4% SPL gains on IR2R over the best memory-persistent baseline, accompanied by 8.3x training speedup and 74% inference memory reduction. The results validate that predictive retrieval of both environmental and behavioral memories enables more effective navigation, with analysis indicating substantial headroom (73.3% vs 93.4% upper bound) for this imagination-guided paradigm. Code at https://github.com/xyz9911/Memoir.