🤖 AI Summary
This work addresses key limitations in existing vision-language navigation methods that employ chain-of-thought (CoT) reasoning—namely, weak spatial awareness, overfitting due to sparse annotations, and high latency from explicitly generating imagined visual representations. To overcome these challenges, the authors propose the first implicit multimodal CoT framework, which leverages a pretrained visual autoregressive model (VAR) to construct a compact latent space. During training, imagined visual states are encoded implicitly, while at inference time, instructions are directly mapped to actions without generating explicit visual tokens, thereby eliminating substantial computational overhead. The framework enables end-to-end joint learning that seamlessly integrates textual, visual, and multimodal reasoning pathways. Evaluated on the LH-VLN dataset, the method achieves significantly higher navigation success rates and reduces inference latency by an order of magnitude compared to explicit CoT approaches.
📝 Abstract
Achieving human-level performance in Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context while reasoning over long action sequences. Recent works, such as NavCoT and NavGPT-2, demonstrate the potential of Chain-of-Thought (CoT) reasoning for improving interpretability and long-horizon planning. Moreover, multimodal extensions like OctoNav-R1 and CoT-VLA further validate CoT as a promising pathway toward human-like navigation reasoning. However, existing approaches face critical drawbacks: purely textual CoTs lack spatial grounding and easily overfit to sparse annotated reasoning steps, while multimodal CoTs incur severe token inflation by generating imagined visual observations, making real-time navigation impractical. In this work, we propose FantasyVLN, a unified implicit reasoning framework that preserves the benefits of CoT reasoning without explicit token overhead. Specifically, imagined visual tokens are encoded into a compact latent space using a pretrained Visual AutoRegressor (VAR) during CoT reasoning training, and the model jointly learns from textual, visual, and multimodal CoT modes under a unified multi-CoT strategy. At inference, our model performs direct instruction-to-action mapping while still enjoying reasoning-aware representations. Extensive experiments on LH-VLN show that our approach achieves reasoning-aware yet real-time navigation, improving success rates and efficiency while reducing inference latency by an order of magnitude compared to explicit CoT methods.