🤖 AI Summary
This study investigates the generalization capability of off-the-shelf large vision-language models (LVLMs) on vision-language navigation (VLN), specifically examining whether they can adapt—without architectural modification—to two heterogeneous action spaces: low-level egocentric atomic actions and high-level panoramic discrete viewpoints.
Method: We conduct the first systematic end-to-end comparison of a single LVLM (Qwen2.5-VL-3B-Instruct) on the R2R benchmark under both action paradigms, relying solely on instruction tuning for adaptation.
Contribution/Results: Our experiments demonstrate that instruction-tuned LVLMs achieve a 41% success rate, confirming their foundational VLN competence. However, performance remains substantially below that of task-specialized models, exposing critical limitations in action abstraction, spatiotemporal reasoning, and environment interaction. This work provides key empirical evidence and a benchmarking framework for evaluating LVLMs’ transferability toward embodied intelligence.
📝 Abstract
Vision-and-Language Navigation (VLN) refers to the task of enabling autonomous robots to navigate unfamiliar environments by following natural language instructions. While recent Large Vision-Language Models (LVLMs) have shown promise in this task, most current VLM systems rely on models specifically designed and optimized for navigation, leaving the potential of off-the-shelf LVLMs underexplored. Furthermore, while older VLN approaches used low-level action spaces with egocentric views and atomic actions (such as "turn left" or "move forward"), newer models tend to favor panoramic action spaces with discrete navigable viewpoints. This paper investigates (1) whether off-the-shelf LVLMs (fine-tuned without architectural modifications or simulator-based training) can effectively support VLN tasks and (2) whether such models can support both low-level and panoramic action paradigms. To this end, we fine-tune the open-source model Qwen2.5-VL-3B-Instruct on the Room-to-Room (R2R) dataset and evaluate its empirical performance across both low-level and panoramic action spaces. The best resulting model achieves a 41% success rate on the R2R test set, demonstrating that while off-the-shelf LVLMs can learn to perform Vision-and-Language Navigation, they still lag behind models specifically designed for this task.