Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

๐Ÿ“… 2025-12-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current vision-language models (VLMs) for Vision-and-Language Navigation (VLN) typically employ end-to-end, short-horizon, discrete action mapping, resulting in jerky motion, high response latency, and poor adaptability to dynamic obstacles and long-horizon planning. To address these limitations, we propose DualVLNโ€”the first dual-system VLN foundation model. Its System 2 (โ€œslowโ€) leverages a VLM for high-level semantic reasoning to generate mid-term waypoints, while its System 1 (โ€œfastโ€) employs a lightweight multimodal Diffusion Transformer that fuses pixel-level observations and latent states to produce smooth, real-time trajectories. This architecture uniquely decouples global path planning from local control, enabling millisecond-scale responsiveness without sacrificing generalization. Experiments demonstrate that DualVLN achieves state-of-the-art performance across all major VLN benchmarks and exhibits robust long-horizon planning and adaptive obstacle avoidance in realistic dynamic environments.

Technology Category

Application Category

๐Ÿ“ Abstract
While recent large vision-language models (VLMs) have improved generalization in vision-language navigation (VLN), existing methods typically rely on end-to-end pipelines that map vision-language inputs directly to short-horizon discrete actions. Such designs often produce fragmented motions, incur high latency, and struggle with real-world challenges like dynamic obstacle avoidance. We propose DualVLN, the first dual-system VLN foundation model that synergistically integrates high-level reasoning with low-level action execution. System 2, a VLM-based global planner, "grounds slowly" by predicting mid-term waypoint goals via image-grounded reasoning. System 1, a lightweight, multi-modal conditioning Diffusion Transformer policy, "moves fast" by leveraging both explicit pixel goals and latent features from System 2 to generate smooth and accurate trajectories. The dual-system design enables robust real-time control and adaptive local decision-making in complex, dynamic environments. By decoupling training, the VLM retains its generalization, while System 1 achieves interpretable and effective local navigation. DualVLN outperforms prior methods across all VLN benchmarks and real-world experiments demonstrate robust long-horizon planning and real-time adaptability in dynamic environments.
Problem

Research questions and friction points this paper is trying to address.

Improves generalization in vision-language navigation with dual-system design
Addresses fragmented motions and high latency in existing VLN methods
Enables robust real-time control in dynamic environments via decoupled training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-system model integrates high-level reasoning with low-level execution
VLM-based planner predicts waypoint goals via image-grounded reasoning
Diffusion Transformer policy generates smooth trajectories using pixel goals and latent features
๐Ÿ”Ž Similar Papers
No similar papers found.
M
Meng Wei
Shanghai AI Laboratory
C
Chenyang Wan
Shanghai AI Laboratory
J
Jiaqi Peng
Shanghai AI Laboratory
X
Xiqian Yu
Shanghai AI Laboratory
Y
Yuqiang Yang
Shanghai AI Laboratory
D
Delin Feng
Shanghai AI Laboratory
Wenzhe Cai
Wenzhe Cai
Shanghai AI Laboratory
Reinforcement LearningVisual NavigationRobotics
Chenming Zhu
Chenming Zhu
The University of Hong Kong
Multimodal Large Language Model3D Vision
Tai Wang
Tai Wang
Shanghai AI Laboratory
Computer Vision3D VisionEmbodied AIDeep Learning
J
Jiangmiao Pang
Shanghai AI Laboratory
Xihui Liu
Xihui Liu
University of Hong Kong, UC Berkeley, CUHK, Tsinghua University
Computer VisionDeep Learning