🤖 AI Summary
Existing vision-and-language navigation (VLN) methods struggle to jointly model environmental dynamics and spatial structure under zero-shot settings, leading to insufficient robustness in long-horizon navigation. This work proposes PROSPECT, a unified streaming navigation agent that leverages learnable flow-query tokens to efficiently predict the next-step 2D/3D features within the latent space of a frozen teacher model, thereby optimizing internal representations without incurring additional inference overhead. PROSPECT is the first to integrate absolute-scale 3D spatial encodings—based on CUT3R—with SigLIP semantic features, employing cross-attention for multimodal fusion and conducting predictive representation learning directly in the latent space. Evaluated on the VLN-CE benchmark and real-world robotic deployment, PROSPECT achieves state-of-the-art performance and significantly enhances navigation robustness under challenging lighting conditions over extended trajectories.
📝 Abstract
Multimodal large language models (MLLMs) have advanced zero-shot end-to-end Vision-Language Navigation (VLN), yet robust navigation requires not only semantic understanding but also predictive modeling of environment dynamics and spatial structure. We propose PROSPECT, a unified streaming navigation agent that couples a streaming Vision-Language-Action (VLA) policy with latent predictive representation learning. PROSPECT uses CUT3R as a streaming 3D foundation spatial encoder to produce long-context, absolute-scale spatial features, and fuses them with SigLIP semantic features via cross-attention. During training, we introduce learnable stream query tokens that query the streaming context and predict next-step 2D and 3D latent features (rather than pixels or explicit modalities), supervised in the latent spaces of frozen SigLIP and CUT3R teachers. The predictive branch shapes internal representations without inference overhead. Experiments on VLN-CE benchmarks and real-robot deployment demonstrate state-of-the-art performance and improved long-horizon robustness under diverse lighting. We will release code for the community soon.