CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-and-language navigation methods struggle to effectively leverage prior experience in long-horizon or unfamiliar environments. This work proposes the first navigation framework that integrates structured multimodal memory, retrieval-augmented generation, and reflective memory updating. The approach constructs a searchable experience repository from panoramic images and salient landmarks, employs a large language model for experience-driven decision-making, and dynamically updates its memory during navigation. Evaluated in both simulated and real-world settings, the method substantially outperforms state-of-the-art baselines—NavGPT, MapGPT, and DiscussNav—achieving average success rate improvements of 52.9%, 20.9%, and 20.9% in simulation, and 200%, 50%, and 50% in real-world environments, respectively.

Technology Category

Application Category

📝 Abstract
Although large language models (LLMs) are introduced into vision-and-language navigation (VLN) to improve instruction comprehension and generalization, existing LLM- based VLN lacks the ability to selectively recall and use relevant priori experiences to help navigation tasks, limiting their performance in long-horizon and unfamiliar scenarios. In this work, we propose CMMR-VLN (Continual Multimodal Memory Retrieval based VLN), a VLN framework that endows LLM agents with structured memory and reflection capabilities. Specifically, the CMMR-VLN constructs a multimodal experi- ence memory indexed by panoramic visual images and salient landmarks to retrieve relevant experiences during navigation, introduces a retrieved-augmented generation pipeline to mimick how experienced human navigators leverage priori knowledge, and incorporates a reflection-based memory update strategy that selectively stores complete successful paths and the key initial mistake in failure cases. Comprehensive tests illustrate average success rate improvements of 52.9%, 20.9% and 20.9%, and 200%, 50% and 50% over the NavGPT, the MapGPT, and the DiscussNav in simulation and real tests, respectively eluci- dating the great potential of the CMMR-VLN as a backbone VLN framework.
Problem

Research questions and friction points this paper is trying to address.

vision-and-language navigation
continual memory retrieval
multimodal memory
prior experience utilization
long-horizon navigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continual Multimodal Memory Retrieval
Vision-and-Language Navigation
Retrieval-Augmented Generation
Memory Reflection
LLM-based Navigation
🔎 Similar Papers
No similar papers found.
H
Haozhou Li
The College of Software at Northeastern University, Shenyang, China
Xiangyu Dong
Xiangyu Dong
Staff Software Engineer, Google
Computer architecture
H
Huiyan Jiang
The College of Software at Northeastern University, Shenyang, China
Y
Yaoming Zhou
The School of Aeronautic Science and Engineering at Beihang University, Beijing, China
X
Xiaoguang Ma
The Foshan Graduate School of Innovation at Northeastern University, Foshan, China; The Faculty of Robot Science and Engineering at Northeastern University, Shenyang, China