Recursive Visual Imagination and Adaptive Linguistic Grounding for Vision Language Navigation

📅 2025-07-28

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Visual-language navigation (VLN) suffers from scene representation redundancy and ambiguous vision-language alignment, leading to insufficient high-level semantic priors and poor instruction following. To address these issues, we propose a recursive visual imagination mechanism and adaptive language grounding. The former models historical trajectories via a compact neural grid and recursively generates high-level visual representations to capture cross-step visual transition patterns. The latter dynamically attends to semantically critical instruction tokens, enabling fine-grained vision-language semantic matching while suppressing interference from low-level geometric details. Our approach achieves significant improvements over state-of-the-art methods on both VLN-CE and ObjectNav benchmarks, substantially boosting navigation accuracy and instruction consistency. These results empirically validate the critical role of explicit high-level scene prior modeling in enabling robust, semantically grounded navigation.

Technology Category

Application Category

📝 Abstract

Vision Language Navigation (VLN) typically requires agents to navigate to specified objects or remote regions in unknown scenes by obeying linguistic commands. Such tasks require organizing historical visual observations for linguistic grounding, which is critical for long-sequence navigational decisions. However, current agents suffer from overly detailed scene representation and ambiguous vision-language alignment, which weaken their comprehension of navigation-friendly high-level scene priors and easily lead to behaviors that violate linguistic commands. To tackle these issues, we propose a navigation policy by recursively summarizing along-the-way visual perceptions, which are adaptively aligned with commands to enhance linguistic grounding. In particular, by structurally modeling historical trajectories as compact neural grids, several Recursive Visual Imagination (RVI) techniques are proposed to motivate agents to focus on the regularity of visual transitions and semantic scene layouts, instead of dealing with misleading geometric details. Then, an Adaptive Linguistic Grounding (ALG) technique is proposed to align the learned situational memories with different linguistic components purposefully. Such fine-grained semantic matching facilitates the accurate anticipation of navigation actions and progress. Our navigation policy outperforms the state-of-the-art methods on the challenging VLN-CE and ObjectNav tasks, showing the superiority of our RVI and ALG techniques for VLN.

Problem

Research questions and friction points this paper is trying to address.

Overly detailed scene representation weakens navigation decisions

Ambiguous vision-language alignment leads to command violations

Need adaptive linguistic grounding for accurate action anticipation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Recursive Visual Imagination for scene regularity

Adaptive Linguistic Grounding for command alignment

Compact neural grids model historical trajectories

🔎 Similar Papers

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models