VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning

📅 2025-06-20

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing vision-language navigation (VLN) methods are constrained by discrete topological graphs, hindering end-to-end generation of continuous actions from egocentric video streams; while large language models (LLMs) excel at high-level instruction understanding, they lack fine-grained action control. This paper proposes the first end-to-end continuous VLN framework based on large vision-language models (LVLMs): it eliminates explicit topological modeling and directly regresses continuous navigation actions from sequences of egocentric video frames. We introduce a GRPO-inspired reinforcement fine-tuning paradigm featuring temporally decaying rewards (TDR) and long-short memory sampling to enable language-guided, action-level optimization. Evaluated on our newly constructed VLN-Ego dataset and the Habitat simulation platform, our method achieves state-of-the-art performance on the VLN-CE benchmark, with significant improvements in data efficiency and task-specific reasoning capability.

Technology Category

Application Category

📝 Abstract

Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions. Current language model-based navigation systems operate on discrete topological graphs, limiting path planning to predefined node connections. We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions, adopting GRPO-based training inspired by DeepSeek-R1. To enable effective training, we first construct the VLN-Ego dataset using a 3D simulator, Habitat, and propose Long-Short Memory Sampling to balance historical and current observations. While large language models can supervise complete textual instructions, they lack fine-grained action-level control. Our framework employs a two-stage training approach: a) Supervised fine-tuning (SFT) to align the model's action sequence text predictions with expert demonstrations, followed by b) Reinforcement fine-tuning (RFT) enhanced with a Time-Decayed Reward (TDR) mechanism that strategically weights multi-step future actions. Experimental results show VLN-R1 achieves strong performance on VLN-CE benchmark. VLN-R1 proves LVLMs can drive embodied navigation and enhance task-specific reasoning through data-efficient, reward-driven post-training.

Problem

Research questions and friction points this paper is trying to address.

Navigate real-world environments using language instructions.

Overcome limitations of discrete topological graph-based systems.

Enable fine-grained action-level control in vision-language navigation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end framework translates video into actions.

Two-stage training aligns actions with expert demos.

Time-Decayed Reward enhances reinforcement fine-tuning.

🔎 Similar Papers

No similar papers found.