Dream to Recall: Imagination-Guided Experience Retrieval for Memory-Persistent Vision-and-Language Navigation

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing vision-and-language navigation (VLN) methods face two key bottlenecks in long-term memory utilization: (1) rigid memory access mechanisms—relying either on global memory integration or fixed-size window retrieval—and (2) memory storage limited to environmental observations, neglecting navigational policy representations. To address these, we propose an imagination-guided memory retrieval framework. It employs a language-conditioned world model to generate predictive state imaginings as dynamic, task-aware queries, enabling precise retrieval of salient experiences from a hybrid-perspective memory bank—jointly encoding both environmental observations and behavioral patterns. Crucially, the framework unifies predictive query generation with experience storage and introduces a dedicated navigation encoder to enhance policy-aware representation learning. Evaluated across ten diverse environments, our method achieves a +5.4% improvement in Success weighted by Path Length (SPL) on the IR2R benchmark, accelerates training by 8.3×, reduces inference memory consumption by 74%, and significantly improves generalization and efficiency for long-horizon navigation.

Technology Category

Application Category

📝 Abstract

Vision-and-Language Navigation (VLN) requires agents to follow natural language instructions through environments, with memory-persistent variants demanding progressive improvement through accumulated experience. Existing approaches for memory-persistent VLN face critical limitations: they lack effective memory access mechanisms, instead relying on entire memory incorporation or fixed-horizon lookup, and predominantly store only environmental observations while neglecting navigation behavioral patterns that encode valuable decision-making strategies. We present Memoir, which employs imagination as a retrieval mechanism grounded by explicit memory: a world model imagines future navigation states as queries to selectively retrieve relevant environmental observations and behavioral histories. The approach comprises: 1) a language-conditioned world model that imagines future states serving dual purposes: encoding experiences for storage and generating retrieval queries; 2) Hybrid Viewpoint-Level Memory that anchors both observations and behavioral patterns to viewpoints, enabling hybrid retrieval; and 3) an experience-augmented navigation model that integrates retrieved knowledge through specialized encoders. Extensive evaluation across diverse memory-persistent VLN benchmarks with 10 distinctive testing scenarios demonstrates Memoir's effectiveness: significant improvements across all scenarios, with 5.4% SPL gains on IR2R over the best memory-persistent baseline, accompanied by 8.3x training speedup and 74% inference memory reduction. The results validate that predictive retrieval of both environmental and behavioral memories enables more effective navigation, with analysis indicating substantial headroom (73.3% vs 93.4% upper bound) for this imagination-guided paradigm. Code at https://github.com/xyz9911/Memoir.

Problem

Research questions and friction points this paper is trying to address.

Enhancing memory access mechanisms for persistent navigation agents

Integrating environmental observations with behavioral decision patterns

Improving navigation efficiency through predictive imagination-guided retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Imagination-guided retrieval of environmental observations

Hybrid memory storing both observations and behavioral patterns

World model generates queries for selective memory access

🔎 Similar Papers

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models