Do Visual Imaginations Improve Vision-and-Language Navigation Agents?

📅 2025-03-20

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Can visual imagination improve the generalization capability of vision-language navigation (VLN) agents in unseen environments? This paper proposes the first method that explicitly integrates text-to-image diffusion-generated visual imagination as multimodal navigational cues. Our approach segments navigation instructions to identify key visual concepts—particularly landmarks—and employs a diffusion model to synthesize corresponding landmark images. To strengthen cross-modal alignment, we introduce a referential expression–image matching auxiliary loss that explicitly supervises language–vision grounding. The method significantly enhances the embodied grounding capacity of VLN agents, achieving approximately +1.0% absolute improvement in success rate and up to +0.5% in SPL across multiple benchmarks—including R2R and CVDN. These results empirically validate that generative visual imagination effectively mitigates the perceptual limitations of purely language-driven navigation, establishing a novel paradigm for multimodal representation learning in embodied intelligence.

Technology Category

Application Category

📝 Abstract

Vision-and-Language Navigation (VLN) agents are tasked with navigating an unseen environment using natural language instructions. In this work, we study if visual representations of sub-goals implied by the instructions can serve as navigational cues and lead to increased navigation performance. To synthesize these visual representations or imaginations, we leverage a text-to-image diffusion model on landmark references contained in segmented instructions. These imaginations are provided to VLN agents as an added modality to act as landmark cues and an auxiliary loss is added to explicitly encourage relating these with their corresponding referring expressions. Our findings reveal an increase in success rate (SR) of around 1 point and up to 0.5 points in success scaled by inverse path length (SPL) across agents. These results suggest that the proposed approach reinforces visual understanding compared to relying on language instructions alone. Code and data for our work can be found at https://www.akhilperincherry.com/VLN-Imagine-website/.

Problem

Research questions and friction points this paper is trying to address.

Enhancing VLN agents with visual imaginations

Using text-to-image models for navigational cues

Improving navigation success rates with visual aids

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-to-image diffusion model for visual imaginations

Added modality for landmark cues in navigation

Auxiliary loss to relate imaginations with instructions

🔎 Similar Papers

Navigation with VLM framework: Go to Any Language