Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

In vision-language navigation (VLN), partial observability hinders precise alignment between visual perception and linguistic instructions, while existing visual synthesis methods suffer from high computational overhead and excessive low-level detail. To address these challenges, we propose a novel paradigm—“language-form textual dreaming”—implemented via a bi-hemispheric, human-inspired architecture: a left-hemisphere (logical branch) fine-tuned Q-Former for navigation reasoning, and a right-hemisphere (imaginative branch) driven by a lightweight LLM to generate semantically coherent textual dreams. We further introduce cross-hemispheric interaction regularization to synergistically integrate general-purpose LLM reasoning with domain-specific navigation knowledge. Evaluated on the R2R benchmark, our method achieves state-of-the-art performance with 42% fewer parameters and 3.1× faster inference speed, while significantly improving semantic imagination accuracy.

Technology Category

Application Category

📝 Abstract

Vision-and-Language Navigation (VLN) requires the agent to navigate by following natural instructions under partial observability, making it difficult to align perception with language. Recent methods mitigate this by imagining future scenes, yet they rely on vision-based synthesis, leading to high computational cost and redundant details. To this end, we propose to adaptively imagine key environmental semantics via extit{language} form, enabling a more reliable and efficient strategy. Specifically, we introduce a novel Adaptive Text Dreamer (ATD), a dual-branch self-guided imagination policy built upon a large language model (LLM). ATD is designed with a human-like left-right brain architecture, where the left brain focuses on logical integration, and the right brain is responsible for imaginative prediction of future scenes. To achieve this, we fine-tune only the Q-former within both brains to efficiently activate domain-specific knowledge in the LLM, enabling dynamic updates of logical reasoning and imagination during navigation. Furthermore, we introduce a cross-interaction mechanism to regularize the imagined outputs and inject them into a navigation expert module, allowing ATD to jointly exploit both the reasoning capacity of the LLM and the expertise of the navigation model. We conduct extensive experiments on the R2R benchmark, where ATD achieves state-of-the-art performance with fewer parameters. The code is href{https://github.com/zhangpingrui/Adaptive-Text-Dreamer}{here}.

Problem

Research questions and friction points this paper is trying to address.

Aligning perception with language in Vision-and-Language Navigation (VLN)

Reducing computational cost of vision-based scene synthesis

Enhancing navigation efficiency via adaptive text-based imagination

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Text Dreamer for VLN

Dual-branch self-guided imagination policy

Cross-interaction mechanism for LLM

🔎 Similar Papers

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models