Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Current vision-language models (VLMs) for Vision-and-Language Navigation (VLN) typically employ end-to-end, short-horizon, discrete action mapping, resulting in jerky motion, high response latency, and poor adaptability to dynamic obstacles and long-horizon planning. To address these limitations, we propose DualVLN—the first dual-system VLN foundation model. Its System 2 (“slow”) leverages a VLM for high-level semantic reasoning to generate mid-term waypoints, while its System 1 (“fast”) employs a lightweight multimodal Diffusion Transformer that fuses pixel-level observations and latent states to produce smooth, real-time trajectories. This architecture uniquely decouples global path planning from local control, enabling millisecond-scale responsiveness without sacrificing generalization. Experiments demonstrate that DualVLN achieves state-of-the-art performance across all major VLN benchmarks and exhibits robust long-horizon planning and adaptive obstacle avoidance in realistic dynamic environments.

Technology Category

Application Category

📝 Abstract

While recent large vision-language models (VLMs) have improved generalization in vision-language navigation (VLN), existing methods typically rely on end-to-end pipelines that map vision-language inputs directly to short-horizon discrete actions. Such designs often produce fragmented motions, incur high latency, and struggle with real-world challenges like dynamic obstacle avoidance. We propose DualVLN, the first dual-system VLN foundation model that synergistically integrates high-level reasoning with low-level action execution. System 2, a VLM-based global planner, "grounds slowly" by predicting mid-term waypoint goals via image-grounded reasoning. System 1, a lightweight, multi-modal conditioning Diffusion Transformer policy, "moves fast" by leveraging both explicit pixel goals and latent features from System 2 to generate smooth and accurate trajectories. The dual-system design enables robust real-time control and adaptive local decision-making in complex, dynamic environments. By decoupling training, the VLM retains its generalization, while System 1 achieves interpretable and effective local navigation. DualVLN outperforms prior methods across all VLN benchmarks and real-world experiments demonstrate robust long-horizon planning and real-time adaptability in dynamic environments.

Problem

Research questions and friction points this paper is trying to address.

Improves generalization in vision-language navigation with dual-system design

Addresses fragmented motions and high latency in existing VLN methods

Enables robust real-time control in dynamic environments via decoupled training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-system model integrates high-level reasoning with low-level execution

VLM-based planner predicts waypoint goals via image-grounded reasoning

Diffusion Transformer policy generates smooth trajectories using pixel goals and latent features

🔎 Similar Papers

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models