EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing LLM-based vision-language navigation (VLN) agents suffer from opaque reasoning, poor generalization, and reliance on scarce, high-quality chain-of-thought (CoT) annotations due to direct input-output mapping. Method: We propose a two-stage self-evolving reasoning framework: (1) formal CoT-supervised fine-tuning to activate navigational reasoning capabilities; and (2) self-reflection training using diverse, model-generated CoTs, augmented by a contrastive auxiliary task for erroneous reasoning detection. The method integrates CoT diversity enhancement, self-supervised optimization, and multimodal joint fine-tuning. Contribution/Results: Our approach significantly improves navigation success rate and path fidelity across mainstream VLN benchmarks, while enhancing decision interpretability and cross-environment generalization robustness.

Technology Category

Application Category

📝 Abstract

Building Vision-Language Navigation (VLN) agents which can navigate following natural language instructions is a long-standing goal in human-robot interaction applications. Recent studies have revealed the potential of training open-source Large Language Models (LLMs) to unleash LLMs' reasoning ability for improving navigation, and simultaneously mitigate the domain gap between LLMs' training corpus and the VLN task. However, these approaches primarily adopt direct input-output mapping paradigms, causing the mapping learning difficult and the navigational decisions unexplainable. Chain-of-Thought (CoT) training is a promising way to improve both navigational decision accuracy and interpretability, while the complexity of the navigation task makes the perfect CoT labels unavailable and may lead to overfitting through pure CoT supervised fine-tuning. In this paper, we propose a novel sElf-improving embodied reasoning framework for boosting LLM-based vision-language Navigation, dubbed EvolveNav. Our EvolveNav consists of two stages: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with formalized CoT labels to both activate the model's navigational reasoning capabilities and increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity. A self-reflective auxiliary task is also introduced to encourage learning correct reasoning patterns by contrasting with wrong ones. Experimental results on the popular VLN benchmarks demonstrate the superiority of EvolveNav over previous LLM-based VLN approaches. Code is available at https://github.com/expectorlin/EvolveNav.

Problem

Research questions and friction points this paper is trying to address.

Improving navigation accuracy via self-enriched Chain-of-Thought reasoning

Mitigating domain gap between LLMs and Vision-Language Navigation tasks

Enhancing interpretability of navigational decisions in embodied agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Formalized CoT Supervised Fine-Tuning for reasoning speed

Self-Reflective Post-Training with enriched CoT labels

Auxiliary task contrasts correct and wrong reasoning patterns

🔎 Similar Papers

Advances in Embodied Navigation Using Large Language Models: A Survey