HiMemVLN: Enhancing Reliability of Open-Source Zero-Shot Vision-and-Language Navigation with Hierarchical Memory System

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work identifies and formally defines “navigation amnesia”—a pervasive issue in open-source large language models (LLMs) for zero-shot vision-and-language navigation (VLN), wherein models struggle to maintain long-term spatial memory, leading to substantially degraded performance compared to closed-source counterparts. To address this limitation, the authors propose a novel multimodal architecture that integrates a spatiotemporal chain-of-thought mechanism with a hierarchical memory system, significantly enhancing the model’s ability to recall visual observations and sustain consistent long-term localization. Experimental results demonstrate that the proposed approach markedly improves navigation success rates in both simulated and real-world environments, achieving nearly double the performance of current state-of-the-art open-source methods and substantially advancing the reliability and practicality of zero-shot VLN systems.

Technology Category

Application Category

📝 Abstract

LLM-based agents have demonstrated impressive zero-shot performance in vision-language navigation (VLN) tasks. However, most zero-shot methods primarily rely on closed-source LLMs as navigators, which face challenges related to high token costs and potential data leakage risks. Recent efforts have attempted to address this by using open-source LLMs combined with a spatiotemporal CoT framework, but they still fall far short compared to closed-source models. In this work, we identify a critical issue, Navigation Amnesia, through a detailed analysis of the navigation process. This issue leads to navigation failures and amplifies the gap between open-source and closed-source methods. To address this, we propose HiMemVLN, which incorporates a Hierarchical Memory System into a multimodal large model to enhance visual perception recall and long-term localization, mitigating the amnesia issue and improving the agent's navigation performance. Extensive experiments in both simulated and real-world environments demonstrate that HiMemVLN achieves nearly twice the performance of the open-source state-of-the-art method. The code is available at https://github.com/lvkailin0118/HiMemVLN.

Problem

Research questions and friction points this paper is trying to address.

Vision-and-Language Navigation

Zero-Shot

Open-Source LLMs

Navigation Amnesia

Multimodal Navigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Memory System

Vision-and-Language Navigation

Zero-Shot Learning