FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key limitations in existing vision-language navigation methods that employ chain-of-thought (CoT) reasoning—namely, weak spatial awareness, overfitting due to sparse annotations, and high latency from explicitly generating imagined visual representations. To overcome these challenges, the authors propose the first implicit multimodal CoT framework, which leverages a pretrained visual autoregressive model (VAR) to construct a compact latent space. During training, imagined visual states are encoded implicitly, while at inference time, instructions are directly mapped to actions without generating explicit visual tokens, thereby eliminating substantial computational overhead. The framework enables end-to-end joint learning that seamlessly integrates textual, visual, and multimodal reasoning pathways. Evaluated on the LH-VLN dataset, the method achieves significantly higher navigation success rates and reduces inference latency by an order of magnitude compared to explicit CoT approaches.

Technology Category

Application Category

📝 Abstract
Achieving human-level performance in Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context while reasoning over long action sequences. Recent works, such as NavCoT and NavGPT-2, demonstrate the potential of Chain-of-Thought (CoT) reasoning for improving interpretability and long-horizon planning. Moreover, multimodal extensions like OctoNav-R1 and CoT-VLA further validate CoT as a promising pathway toward human-like navigation reasoning. However, existing approaches face critical drawbacks: purely textual CoTs lack spatial grounding and easily overfit to sparse annotated reasoning steps, while multimodal CoTs incur severe token inflation by generating imagined visual observations, making real-time navigation impractical. In this work, we propose FantasyVLN, a unified implicit reasoning framework that preserves the benefits of CoT reasoning without explicit token overhead. Specifically, imagined visual tokens are encoded into a compact latent space using a pretrained Visual AutoRegressor (VAR) during CoT reasoning training, and the model jointly learns from textual, visual, and multimodal CoT modes under a unified multi-CoT strategy. At inference, our model performs direct instruction-to-action mapping while still enjoying reasoning-aware representations. Extensive experiments on LH-VLN show that our approach achieves reasoning-aware yet real-time navigation, improving success rates and efficiency while reducing inference latency by an order of magnitude compared to explicit CoT methods.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Navigation
Chain-of-Thought Reasoning
Multimodal Reasoning
Token Inflation
Spatial Grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought Reasoning
Vision-Language Navigation
Multimodal Reasoning
Latent Space Compression
Real-time Navigation
🔎 Similar Papers
No similar papers found.
J
Jing Zuo
Fantasy AIGC Team; Beijing University of Posts and Telecommunications
Lingzhou Mu
Lingzhou Mu
清华大学
AIGCvideo generationAI security
F
Fan Jiang
Fantasy AIGC Team
C
Chengcheng Ma
Fantasy AIGC Team
M
Mu Xu
Fantasy AIGC Team
Yonggang Qi
Yonggang Qi
Associate Professor, Beijing University of Posts and Telecommunications
computer visionsketch-based vision learning algorithms and applications