Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In vision-language navigation (VLN), partial observability hinders precise alignment between visual perception and linguistic instructions, while existing visual synthesis methods suffer from high computational overhead and excessive low-level detail. To address these challenges, we propose a novel paradigm—“language-form textual dreaming”—implemented via a bi-hemispheric, human-inspired architecture: a left-hemisphere (logical branch) fine-tuned Q-Former for navigation reasoning, and a right-hemisphere (imaginative branch) driven by a lightweight LLM to generate semantically coherent textual dreams. We further introduce cross-hemispheric interaction regularization to synergistically integrate general-purpose LLM reasoning with domain-specific navigation knowledge. Evaluated on the R2R benchmark, our method achieves state-of-the-art performance with 42% fewer parameters and 3.1× faster inference speed, while significantly improving semantic imagination accuracy.

Technology Category

Application Category

📝 Abstract
Vision-and-Language Navigation (VLN) requires the agent to navigate by following natural instructions under partial observability, making it difficult to align perception with language. Recent methods mitigate this by imagining future scenes, yet they rely on vision-based synthesis, leading to high computational cost and redundant details. To this end, we propose to adaptively imagine key environmental semantics via extit{language} form, enabling a more reliable and efficient strategy. Specifically, we introduce a novel Adaptive Text Dreamer (ATD), a dual-branch self-guided imagination policy built upon a large language model (LLM). ATD is designed with a human-like left-right brain architecture, where the left brain focuses on logical integration, and the right brain is responsible for imaginative prediction of future scenes. To achieve this, we fine-tune only the Q-former within both brains to efficiently activate domain-specific knowledge in the LLM, enabling dynamic updates of logical reasoning and imagination during navigation. Furthermore, we introduce a cross-interaction mechanism to regularize the imagined outputs and inject them into a navigation expert module, allowing ATD to jointly exploit both the reasoning capacity of the LLM and the expertise of the navigation model. We conduct extensive experiments on the R2R benchmark, where ATD achieves state-of-the-art performance with fewer parameters. The code is href{https://github.com/zhangpingrui/Adaptive-Text-Dreamer}{here}.
Problem

Research questions and friction points this paper is trying to address.

Aligning perception with language in Vision-and-Language Navigation (VLN)
Reducing computational cost of vision-based scene synthesis
Enhancing navigation efficiency via adaptive text-based imagination
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Text Dreamer for VLN
Dual-branch self-guided imagination policy
Cross-interaction mechanism for LLM
Pingrui Zhang
Pingrui Zhang
Fudan University
roboticsembodied AIcomputer vision
Yifei Su
Yifei Su
Institute of Automation, Chinese Academy of Sciences
Embodied AIMultimodal Learning
P
Pengyuan Wu
Shanghai AI Laboratory
D
Dong An
Mohamed bin Zayed University of Artificial Intelligence
L
Li Zhang
University of Science and Technology of China
Z
Zhigang Wang
Shanghai AI Laboratory
D
Dong Wang
Shanghai AI Laboratory
Y
Yan Ding
Shanghai AI Laboratory
B
Bin Zhao
Shanghai AI Laboratory
X
Xuelong Li
TeleAI, China Telecom Corp Ltd