Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

📅 2024-07-09
🏛️ arXiv.org
📈 Citations: 7
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses three core challenges facing Vision-and-Language Navigation (VLN) in the foundation model era: weak embodied reasoning, poor cross-modal alignment, and inadequate long-horizon planning. To this end, it proposes the first foundation model–ready analytical framework for VLN, systematically characterizing task requirements against the capabilities and limitations of large language models (LLMs) and vision-language models (VLMs). Methodologically, the framework integrates embodied AI, multimodal representation learning, instruction tuning, world models, and chain-of-thought reasoning to model synergistic mechanisms among environmental understanding, action policy generation, and causal inference. Key contributions include: (1) establishing an embodied planning and reasoning–centric paradigm as the unifying theoretical foundation; (2) distilling critical milestones in VLN’s evolution; and (3) identifying zero-shot generalization, simulation-to-reality transfer, and interpretable decision-making as three pivotal research directions—thereby delivering the first systematic roadmap for interdisciplinary research at the intersection of VLN and foundation models.

Technology Category

Application Category

📝 Abstract
Vision-and-Language Navigation (VLN) has gained increasing attention over recent years and many approaches have emerged to advance their development. The remarkable achievements of foundation models have shaped the challenges and proposed methods for VLN research. In this survey, we provide a top-down review that adopts a principled framework for embodied planning and reasoning, and emphasizes the current methods and future opportunities leveraging foundation models to address VLN challenges. We hope our in-depth discussions could provide valuable resources and insights: on one hand, to milestone the progress and explore opportunities and potential roles for foundation models in this field, and on the other, to organize different challenges and solutions in VLN to foundation model researchers.
Problem

Research questions and friction points this paper is trying to address.

Visual and Language Navigation
Foundation Models
Future Research Opportunities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual and Language Navigation
Foundation Models
Coherent Framework
🔎 Similar Papers
No similar papers found.