ETP-R1: Evolving Topological Planning with Reinforcement Fine-tuning for Vision-Language Navigation in Continuous Environments

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To bridge the performance gap between graph-structured methods and large vision-language models (LVLMs) in Vision-and-Language Navigation in Continuous Environments (VLN-CE), this paper proposes the first end-to-end topological-graph-based VLN planning framework. Methodologically, it introduces three key innovations: (1) the first application of closed-loop online Reinforcement Fine-Tuning (RFT) to graph-structured VLN models; (2) construction of a large-scale, high-quality topological trajectory–instruction dataset, leveraging Gemini API to generate low-hallucination natural-language instructions; and (3) a three-stage training paradigm comprising R2R/RxR multi-task joint pretraining, GRPO-driven policy fine-tuning, and RFT-based reinforcement optimization. Evaluated on R2R-CE and RxR-CE benchmarks, the framework establishes new state-of-the-art results, achieving significant improvements in Success Rate (SR) and Success-weighted by Path Length (SPL).

Technology Category

Application Category

📝 Abstract
Vision-Language Navigation in Continuous Environments (VLN-CE) requires an embodied agent to navigate towards target in continuous environments, following natural language instructions. While current graph-based methods offer an efficient, structured approach by abstracting the environment into a topological map and simplifying the action space to waypoint selection, they lag behind methods based on Large Vision-Language Models (LVLMs) in leveraging large-scale data and advanced training paradigms. In this paper, we try to bridge this gap by introducing ETP-R1, a framework that applies the paradigm of scaling up data and Reinforcement Fine-Tuning (RFT) to a graph-based VLN-CE model. To build a strong foundation, we first construct a high-quality, large-scale pretraining dataset using the Gemini API. This dataset consists of diverse, low-hallucination instructions for topological trajectories, providing rich supervision for our graph-based policy to map language to topological paths. This foundation is further strengthened by unifying data from both R2R and RxR tasks for joint pretraining. Building on this, we introduce a three-stage training paradigm, which culminates in the first application of closed-loop, online RFT to a graph-based VLN-CE model, powered by the Group Relative Policy Optimization (GRPO) algorithm. Extensive experiments demonstrate that our approach is highly effective, establishing new state-of-the-art performance across all major metrics on both the R2R-CE and RxR-CE benchmarks. Our code is available at https://github.com/Cepillar/ETP-R1.
Problem

Research questions and friction points this paper is trying to address.

Bridges the performance gap between graph-based and LVLM-based methods for VLN-CE.
Enables graph-based VLN-CE models to leverage large-scale data and advanced reinforcement fine-tuning.
Improves agent navigation in continuous environments using natural language instructions.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses large-scale pretraining dataset via Gemini API
Applies three-stage training with reinforcement fine-tuning
Implements Group Relative Policy Optimization algorithm
🔎 Similar Papers
No similar papers found.
S
Shuhao Ye
Zhejiang University, Hangzhou, China
Sitong Mao
Sitong Mao
Huawei, The Hong Kong Polytechnic University
CVMulti-modalityEmbodied AITransfer learning
Yuxiang Cui
Yuxiang Cui
Zhejiang Humanoid Robot Innovation Center, Ningbo, China
X
Xuan Yu
Zhejiang University, Hangzhou, China
S
Shichao Zhai
Zhejiang University, Hangzhou, China
W
Wen Chen
Huawei Technologies Co., Ltd
Shunbo Zhou
Shunbo Zhou
Huawei | The Chinese University of Hong Kong
RoboticsEmbodied AIAutonomous Navigation
Rong Xiong
Rong Xiong
Zhejiang University
Robotics
Y
Yue Wang
Zhejiang University, Hangzhou, China