P$^{3}$Nav: End-to-End Perception, Prediction and Planning for Vision-and-Language Navigation

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work proposes the first end-to-end unified framework for vision-and-language navigation that overcomes the limitations of existing methods, which often fail to achieve effective path planning due to insufficient scene understanding. By deeply integrating perception, prediction, and planning, the framework enhances environmental comprehension through multi-granularity visual perception, models latent states via future waypoint prediction, and constructs semantic maps based on anticipated waypoints to enable proactive and forward-looking decision-making. This approach departs from conventional passive planning paradigms that rely heavily on historical context. Extensive experiments demonstrate state-of-the-art performance across multiple benchmarks—REVERIE, R2R-CE, and RxR-CE—with significant improvements in navigation success rates.

Technology Category

Application Category

📝 Abstract

In Vision-and-Language Navigation (VLN), an agent is required to plan a path to the target specified by the language instruction, using its visual observations. Consequently, prevailing VLN methods primarily focus on building powerful planners through visual-textual alignment. However, these approaches often bypass the imperative of comprehensive scene understanding prior to planning, leaving the agent with insufficient perception or prediction capabilities. Thus, we propose P$^{3}$Nav, a novel end-to-end framework integrating perception, prediction, and planning in a unified pipeline to strengthen the VLN agent's scene understanding and boost navigation success. Specifically, P$^{3}$Nav augments perception by extracting complementary cues from object-level and map-level perspectives. Subsequently, our P$^{3}$Nav predicts waypoints to model the agent's potential future states, endowing the agent with intrinsic awareness of candidate positions during navigation. Conditioned on these future waypoints, P$^{3}$Nav further forecasts semantic map cues, enabling proactive planning and reducing the strict reliance on purely historical context. Integrating these perceptual and predictive cues, a holistic planning module finally carries out the VLN tasks. Extensive experiments demonstrate that our P$^{3}$Nav achieves new state-of-the-art performance on the REVERIE, R2R-CE, and RxR-CE benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Vision-and-Language Navigation

scene understanding

perception

prediction

planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

end-to-end navigation

perception-prediction-planning integration

waypoint prediction