PROSPECT: Unified Streaming Vision-Language Navigation via Semantic--Spatial Fusion and Latent Predictive Representation

📅 2026-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-and-language navigation (VLN) methods struggle to jointly model environmental dynamics and spatial structure under zero-shot settings, leading to insufficient robustness in long-horizon navigation. This work proposes PROSPECT, a unified streaming navigation agent that leverages learnable flow-query tokens to efficiently predict the next-step 2D/3D features within the latent space of a frozen teacher model, thereby optimizing internal representations without incurring additional inference overhead. PROSPECT is the first to integrate absolute-scale 3D spatial encodings—based on CUT3R—with SigLIP semantic features, employing cross-attention for multimodal fusion and conducting predictive representation learning directly in the latent space. Evaluated on the VLN-CE benchmark and real-world robotic deployment, PROSPECT achieves state-of-the-art performance and significantly enhances navigation robustness under challenging lighting conditions over extended trajectories.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have advanced zero-shot end-to-end Vision-Language Navigation (VLN), yet robust navigation requires not only semantic understanding but also predictive modeling of environment dynamics and spatial structure. We propose PROSPECT, a unified streaming navigation agent that couples a streaming Vision-Language-Action (VLA) policy with latent predictive representation learning. PROSPECT uses CUT3R as a streaming 3D foundation spatial encoder to produce long-context, absolute-scale spatial features, and fuses them with SigLIP semantic features via cross-attention. During training, we introduce learnable stream query tokens that query the streaming context and predict next-step 2D and 3D latent features (rather than pixels or explicit modalities), supervised in the latent spaces of frozen SigLIP and CUT3R teachers. The predictive branch shapes internal representations without inference overhead. Experiments on VLN-CE benchmarks and real-robot deployment demonstrate state-of-the-art performance and improved long-horizon robustness under diverse lighting. We will release code for the community soon.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Navigation
spatial structure
environment dynamics
predictive modeling
semantic understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming vision-language navigation
semantic-spatial fusion
latent predictive representation
foundation spatial encoder
cross-attention
🔎 Similar Papers
No similar papers found.
Z
Zehua Fan
Shanghai Jiao Tong University, Shanghai, China
Wenqi Lyu
Wenqi Lyu
The university of Adelaide
Embodied-AI
Wenxuan Song
Wenxuan Song
The Hong Kong University of Science and Technology (Guangzhou)
Vision-language-action ModelRobotics
L
Linge Zhao
Wuhan University, Wuhan, China
Yifei Yang
Yifei Yang
Shanghai Jiao Tong University
Natural Language Processing
X
Xi Wang
AIR Wuxi Innovation Center, Tsinghua University, Wuxi, China
Junjie He
Junjie He
Guizhou University
MRIDeep LearningCT
L
Lida Huang
Tsinghua University, Beijing, China
H
Haiyan Liu
Lenovo, Beijing, China
B
Bingchuan Sun
Lenovo, Beijing, China
G
Guangjun Bao
Lenovo, Beijing, China
X
Xuanyao Mao
Lenovo, Beijing, China
L
Liang Xu
Lenovo, Beijing, China
Yan Wang
Yan Wang
Tsinghua university; SenseTime
Neural CompressionComputer VisionMachine Learning
Feng Gao
Feng Gao
Ocean University of China
Hyperspectral image processingArtificial Intelligence Oceanography