NavQ: Learning a Q-Model for Foresighted Vision-and-Language Navigation

๐Ÿ“… 2025-10-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing goal-oriented vision-language navigation (VLN) methods predominantly rely on historical observations and lack explicit modeling of the long-term consequences of actions, resulting in limited foresight. To address this, we propose the first Q-learningโ€“based anticipatory VLN framework. Our method employs self-supervised learning on large-scale unlabeled trajectory data to train a Q-model that predicts multimodal future observations induced by candidate actions. We design a cross-modal future encoder that jointly fuses historical states and future Q-features to produce action-value scores. Furthermore, we integrate an A*-style search mechanism to enhance long-horizon planning capability. This work marks the first application of Q-learning to VLN, explicitly modeling indoor scene layouts and object relationships. It significantly improves navigation performance on long trajectories and in complex environments, achieving state-of-the-art results on standard benchmarks including R2R and CVDN.

Technology Category

Application Category

๐Ÿ“ Abstract
In this work we concentrate on the task of goal-oriented Vision-and-Language Navigation (VLN). Existing methods often make decisions based on historical information, overlooking the future implications and long-term outcomes of the actions. In contrast, we aim to develop a foresighted agent. Specifically, we draw upon Q-learning to train a Q-model using large-scale unlabeled trajectory data, in order to learn the general knowledge regarding the layout and object relations within indoor scenes. This model can generate a Q-feature, analogous to the Q-value in traditional Q-network, for each candidate action, which describes the potential future information that may be observed after taking the specific action. Subsequently, a cross-modal future encoder integrates the task-agnostic Q-feature with navigation instructions to produce a set of action scores reflecting future prospects. These scores, when combined with the original scores based on history, facilitate an A*-style searching strategy to effectively explore the regions that are more likely to lead to the destination. Extensive experiments conducted on widely used goal-oriented VLN datasets validate the effectiveness of the proposed method.
Problem

Research questions and friction points this paper is trying to address.

Addresses foresighted decision-making in vision-language navigation
Overcomes historical bias by modeling future action consequences
Integrates Q-learning with cross-modal reasoning for navigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trains Q-model using unlabeled trajectory data
Integrates Q-feature with navigation instructions
Combines future scores with history for A* search
๐Ÿ”Ž Similar Papers
No similar papers found.
P
Peiran Xu
Peking University, Beijing, China
X
Xicheng Gong
Peking University, Beijing, China
Yadong Mu
Yadong Mu
Peking University
Computer VisionRoboticsMachine Learning