Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This study investigates the generalization capability of off-the-shelf large vision-language models (LVLMs) on vision-language navigation (VLN), specifically examining whether they can adapt—without architectural modification—to two heterogeneous action spaces: low-level egocentric atomic actions and high-level panoramic discrete viewpoints. Method: We conduct the first systematic end-to-end comparison of a single LVLM (Qwen2.5-VL-3B-Instruct) on the R2R benchmark under both action paradigms, relying solely on instruction tuning for adaptation. Contribution/Results: Our experiments demonstrate that instruction-tuned LVLMs achieve a 41% success rate, confirming their foundational VLN competence. However, performance remains substantially below that of task-specialized models, exposing critical limitations in action abstraction, spatiotemporal reasoning, and environment interaction. This work provides key empirical evidence and a benchmarking framework for evaluating LVLMs’ transferability toward embodied intelligence.

Technology Category

Application Category

📝 Abstract

Vision-and-Language Navigation (VLN) refers to the task of enabling autonomous robots to navigate unfamiliar environments by following natural language instructions. While recent Large Vision-Language Models (LVLMs) have shown promise in this task, most current VLM systems rely on models specifically designed and optimized for navigation, leaving the potential of off-the-shelf LVLMs underexplored. Furthermore, while older VLN approaches used low-level action spaces with egocentric views and atomic actions (such as "turn left" or "move forward"), newer models tend to favor panoramic action spaces with discrete navigable viewpoints. This paper investigates (1) whether off-the-shelf LVLMs (fine-tuned without architectural modifications or simulator-based training) can effectively support VLN tasks and (2) whether such models can support both low-level and panoramic action paradigms. To this end, we fine-tune the open-source model Qwen2.5-VL-3B-Instruct on the Room-to-Room (R2R) dataset and evaluate its empirical performance across both low-level and panoramic action spaces. The best resulting model achieves a 41% success rate on the R2R test set, demonstrating that while off-the-shelf LVLMs can learn to perform Vision-and-Language Navigation, they still lag behind models specifically designed for this task.

Problem

Research questions and friction points this paper is trying to address.

Exploring off-the-shelf LVLMs for Vision-and-Language Navigation tasks

Comparing low-level and panoramic action spaces in VLN performance

Assessing fine-tuned LVLMs' effectiveness without navigation-specific designs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Off-the-shelf LVLMs for VLN tasks

Fine-tuning Qwen2.5-VL-3B-Instruct model

Comparing low-level and panoramic action spaces

🔎 Similar Papers

Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning