Vision-and-Language Navigation with Analogical Textual Descriptions in LLMs

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing zero-shot vision-and-language navigation (VLN) agents suffer from a dual limitation: image-to-text scene descriptions often omit fine-grained visual details, while direct processing of raw images hinders high-level semantic reasoning. To address this, we propose Multi-view Analogical Captioning (MAC), a novel framework that leverages large language models to generate cross-view, abstraction-rich analogical captions—explicitly encoding spatial relations and contextual logic. MAC transforms visual inputs into structured, reasoning-enriched linguistic representations without introducing additional trainable parameters. Evaluated in a zero-shot, end-to-end setting on the R2R benchmark, MAC significantly improves path understanding and action decision accuracy, achieving a 5.2% absolute gain in task success rate over strong baselines. These results empirically validate the efficacy of analogical language representations for spatial reasoning and contextual modeling in VLN.

Technology Category

Application Category

📝 Abstract
Integrating large language models (LLMs) into embodied AI models is becoming increasingly prevalent. However, existing zero-shot LLM-based Vision-and-Language Navigation (VLN) agents either encode images as textual scene descriptions, potentially oversimplifying visual details, or process raw image inputs, which can fail to capture abstract semantics required for high-level reasoning. In this paper, we improve the navigation agent's contextual understanding by incorporating textual descriptions from multiple perspectives that facilitate analogical reasoning across images. By leveraging text-based analogical reasoning, the agent enhances its global scene understanding and spatial reasoning, leading to more accurate action decisions. We evaluate our approach on the R2R dataset, where our experiments demonstrate significant improvements in navigation performance.
Problem

Research questions and friction points this paper is trying to address.

Improves visual scene understanding through analogical textual descriptions
Enhances navigation agents' contextual and spatial reasoning capabilities
Addresses limitations of text-based or raw image processing in VLN
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates textual descriptions from multiple perspectives
Leverages analogical reasoning across images
Enhances global scene understanding and spatial reasoning
🔎 Similar Papers
No similar papers found.