SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models

📅 2025-07-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-and-language navigation (VLN) methods rely on large language models (LLMs) with static knowledge, hindering experience accumulation and utilization—thus limiting generalization and evolutionary capability. This paper proposes the first self-evolving multimodal LLM framework tailored for VLN, introducing three key innovations: hierarchical memory, retrieval-augmented reasoning, and automated reflection. These enable experience-driven continual learning and multi-step decision optimization during test-time inference. Crucially, the agent evolves *in situ* while executing navigation tasks in unseen environments, significantly enhancing long-horizon robustness. Evaluated on R2R and REVERSE benchmarks, our method achieves success rates of 57.0% and 35.2%, respectively—representing absolute improvements of 23.9% and 15.0% over prior state-of-the-art. Moreover, performance consistently improves with increasing interaction experience, demonstrating genuine online adaptation.

Technology Category

Application Category

📝 Abstract
Recent advances in vision-language navigation (VLN) were mainly attributed to emerging large language models (LLMs). These methods exhibited excellent generalization capabilities in instruction understanding and task reasoning. However, they were constrained by the fixed knowledge bases and reasoning abilities of LLMs, preventing fully incorporating experiential knowledge and thus resulting in a lack of efficient evolutionary capacity. To address this, we drew inspiration from the evolution capabilities of natural agents, and proposed a self-evolving VLN framework (SE-VLN) to endow VLN agents with the ability to continuously evolve during testing. To the best of our knowledge, it was the first time that an multimodal LLM-powered self-evolving VLN framework was proposed. Specifically, SE-VLN comprised three core modules, i.e., a hierarchical memory module to transfer successful and failure cases into reusable knowledge, a retrieval-augmented thought-based reasoning module to retrieve experience and enable multi-step decision-making, and a reflection module to realize continual evolution. Comprehensive tests illustrated that the SE-VLN achieved navigation success rates of 57% and 35.2% in unseen environments, representing absolute performance improvements of 23.9% and 15.0% over current state-of-the-art methods on R2R and REVERSE datasets, respectively. Moreover, the SE-VLN showed performance improvement with increasing experience repository, elucidating its great potential as a self-evolving agent framework for VLN.
Problem

Research questions and friction points this paper is trying to address.

Enables VLN agents to evolve continuously during testing
Overcomes fixed knowledge limitations in current LLM-based VLN
Improves navigation success rates in unseen environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-evolving VLN framework with multimodal LLMs
Hierarchical memory for reusing success and failure cases
Retrieval-augmented reasoning for multi-step decision-making
🔎 Similar Papers
No similar papers found.
Xiangyu Dong
Xiangyu Dong
Staff Software Engineer, Google
Computer architecture
H
Haoran Zhao
School of Aeronautic Science and Engineering, Beihang University, Beijing, China
J
Jiang Gao
Foshan Graduate School of Innovation, Northeastern University, Foshan, China; Faculty of Robot Science and Engineering, Northeastern University, Shenyang, China
H
Haozhou Li
Foshan Graduate School of Innovation, Northeastern University, Foshan, China
X
Xiaoguang Ma
Foshan Graduate School of Innovation, Northeastern University, Foshan, China; Faculty of Robot Science and Engineering, Northeastern University, Shenyang, China
Y
Yaoming Zhou
School of Aeronautic Science and Engineering, Beihang University, Beijing, China
F
Fuhai Chen
School of Computer Science and Big Data, Fuzhou University, Fuzhou, China
Juan Liu
Juan Liu
Wuhan University
Data MiningArtificial Intelligence in BioinformaticsBiomedicine