HiMemVLN: Enhancing Reliability of Open-Source Zero-Shot Vision-and-Language Navigation with Hierarchical Memory System

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies and formally defines “navigation amnesia”—a pervasive issue in open-source large language models (LLMs) for zero-shot vision-and-language navigation (VLN), wherein models struggle to maintain long-term spatial memory, leading to substantially degraded performance compared to closed-source counterparts. To address this limitation, the authors propose a novel multimodal architecture that integrates a spatiotemporal chain-of-thought mechanism with a hierarchical memory system, significantly enhancing the model’s ability to recall visual observations and sustain consistent long-term localization. Experimental results demonstrate that the proposed approach markedly improves navigation success rates in both simulated and real-world environments, achieving nearly double the performance of current state-of-the-art open-source methods and substantially advancing the reliability and practicality of zero-shot VLN systems.

Technology Category

Application Category

📝 Abstract
LLM-based agents have demonstrated impressive zero-shot performance in vision-language navigation (VLN) tasks. However, most zero-shot methods primarily rely on closed-source LLMs as navigators, which face challenges related to high token costs and potential data leakage risks. Recent efforts have attempted to address this by using open-source LLMs combined with a spatiotemporal CoT framework, but they still fall far short compared to closed-source models. In this work, we identify a critical issue, Navigation Amnesia, through a detailed analysis of the navigation process. This issue leads to navigation failures and amplifies the gap between open-source and closed-source methods. To address this, we propose HiMemVLN, which incorporates a Hierarchical Memory System into a multimodal large model to enhance visual perception recall and long-term localization, mitigating the amnesia issue and improving the agent's navigation performance. Extensive experiments in both simulated and real-world environments demonstrate that HiMemVLN achieves nearly twice the performance of the open-source state-of-the-art method. The code is available at https://github.com/lvkailin0118/HiMemVLN.
Problem

Research questions and friction points this paper is trying to address.

Vision-and-Language Navigation
Zero-Shot
Open-Source LLMs
Navigation Amnesia
Multimodal Navigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Memory System
Vision-and-Language Navigation
Zero-Shot Learning
Navigation Amnesia
Open-Source LLMs
🔎 Similar Papers
No similar papers found.
K
Kailin Lyu
Institute of Automation, Chinese Academy of Sciences, and the Zhongguancun Academy, Beijing, China
K
Kangyi Wu
Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, China
P
Pengna Li
Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, China
X
Xiuyu Hu
School of Transportation, Tongji University, Shanghai, China
Q
Qingyi Si
JD.com, Beijing, China
C
Cui Miao
Institute of National University of Defense Technology, Changsha, China
N
Ning Yang
School of Intelligent Science and Technology, Nanjing University, Suzhou, China, and the Institute of Automation, Chinese Academy of Sciences, Beijing, China
Z
Zihang Wang
Southeast University, Nanjing, China
Long Xiao
Long Xiao
University of Cambridge, Engineering Department, Cavendish Laboratory
GraphenePhotonicsTerahertzCommunication System
L
Lianyu Hu
Nanyang Technological University, Singapore
Jingyuan Sun
Jingyuan Sun
Assistant Professor, The University of Manchester
neural encoding and decodingbrain machine interfacelarge language models
Ce Hao
Ce Hao
National University of Singapore