LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning

πŸ“… 2026-03-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing vision-and-language navigation models struggle to capture the causal relationship between actions and environmental changes due to their neglect of action-induced future visual dynamics, resulting in suboptimal decision robustness. To address this limitation, this work proposes LatentPilot, which introduces an unsupervised global continuous latent variable for the first time. During training, the model leverages future observations to learn action-conditioned visual dynamics; at inference time, it enables cross-timestep β€œforesight” without requiring future inputs. The approach integrates a flywheel-style reinforcement learning scheme, an expert takeover mechanism, and global attention in latent space. LatentPilot achieves new state-of-the-art performance on the R2R-CE, RxR-CE, and R2R-PE benchmarks and demonstrates superior understanding of scene dynamics in real-world robotic experiments.
πŸ“ Abstract
Existing vision-and-language navigation (VLN) models primarily reason over past and current visual observations, while largely ignoring the future visual dynamics induced by actions. As a result, they often lack an effective understanding of the causal relationship between actions and how the visual world changes, limiting robust decision-making. Humans, in contrast, can imagine the near future by leveraging action-dynamics causality, which improves both environmental understanding and navigation choices. Inspired by this capability, we propose LatentPilot, a new paradigm that exploits future observations during training as a valuable data source to learn action-conditioned visual dynamics, while requiring no access to future frames at inference. Concretely, we propose a flywheel-style training mechanism that iteratively collects on-policy trajectories and retrains the model to better match the agent's behavior distribution, with an expert takeover triggered when the agent deviates excessively. LatentPilot further learns visual latent tokens without explicit supervision; these latent tokens attend globally in a continuous latent space and are carried across steps, serving as both the current output and the next input, thereby enabling the agent to dream ahead and reason about how actions will affect subsequent observations. Experiments on R2R-CE, RxR-CE, and R2R-PE benchmarks achieve new SOTA results, and real-robot tests across diverse environments demonstrate LatentPilot's superior understanding of environment-action dynamics in scene. Project page:https://abdd.top/latentpilot/
Problem

Research questions and friction points this paper is trying to address.

vision-and-language navigation
future visual dynamics
action-conditioned visual dynamics
causal reasoning
scene understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-and-language navigation
latent visual reasoning
action-conditioned dynamics
flywheel training
future imagination