StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

📅 2025-07-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of continuous visual stream understanding and low-latency action generation in real-world vision-language navigation (VLN), this paper proposes a streaming multimodal inference framework. The core innovation is a slow-fast dual-stream contextual modeling architecture: a fast stream processes real-time visual frames for low-latency response, while a slow stream compresses long-horizon visual context via sliding dialogue windows and 3D-aware token pruning, enabling efficient KV cache reuse. This design jointly optimizes fine-grained visual perception, long-range dependency modeling, and inference efficiency under constrained computational budgets. The framework is compatible with video large language models and supports interleaved processing of visual, linguistic, and action inputs. Evaluated on the VLN-CE benchmark, it achieves state-of-the-art performance, significantly improving inference efficiency and deployment robustness for long-video-stream navigation tasks.

Technology Category

Application Category

📝 Abstract
Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of active dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves coherent multi-turn dialogue through efficient KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment. The project page is: href{https://streamvln.github.io/}{https://streamvln.github.io/}.
Problem

Research questions and friction points this paper is trying to address.

Balancing fine-grained visual understanding and computational efficiency in VLN
Enabling low-latency action generation from continuous visual and language inputs
Managing long-term context modeling with bounded computational resources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid slow-fast context modeling strategy
3D-aware token pruning for visual compression
Efficient KV cache reuse for low latency
🔎 Similar Papers
No similar papers found.
M
Meng Wei
Shanghai AI Laboratory
C
Chenyang Wan
Shanghai AI Laboratory, Zhejiang University
X
Xiqian Yu
Shanghai AI Laboratory
Tai Wang
Tai Wang
Shanghai AI Laboratory
Computer Vision3D VisionEmbodied AIDeep Learning
Y
Yuqiang Yang
Shanghai AI Laboratory
X
Xiaohan Mao
Shanghai AI Laboratory, Shanghai Jiao Tong University
Chenming Zhu
Chenming Zhu
The University of Hong Kong
Multimodal Large Language Model3D Vision
Wenzhe Cai
Wenzhe Cai
Shanghai AI Laboratory
Reinforcement LearningVisual NavigationRobotics
H
Hanqing Wang
Shanghai AI Laboratory
Y
Yilun Chen
Shanghai AI Laboratory
Xihui Liu
Xihui Liu
University of Hong Kong, UC Berkeley, CUHK, Tsinghua University
Computer VisionDeep Learning
J
Jiangmiao Pang
Shanghai AI Laboratory