Language-Conditioned World Modeling for Visual Navigation

📅 2026-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of embodied navigation using only first-person visual observations and natural language instructions, without relying on goal images. To this end, the authors introduce the first large-scale Language-Conditioned Visual Navigation (LCVN) benchmark dataset, comprising 39,016 trajectories paired with 117,048 human-verified instructions, and formulate the task as language-guided open-loop trajectory prediction. Two unified frameworks are proposed: LCVN-WM/AC, which integrates a diffusion-based world model with a latent-space actor-critic policy, and LCVN-Uni, an autoregressive multimodal architecture that jointly predicts actions and future observations. Experiments demonstrate that LCVN-WM/AC produces temporally smoother trajectories, while LCVN-Uni exhibits stronger generalization in unseen environments, collectively advancing the development of language-conditioned world models.
📝 Abstract
We study language-conditioned visual navigation (LCVN), in which an embodied agent is asked to follow a natural language instruction based only on an initial egocentric observation. Without access to goal images, the agent must rely on language to shape its perception and continuous control, making the grounding problem particularly challenging. We formulate this problem as open-loop trajectory prediction conditioned on linguistic instructions and introduce the LCVN Dataset, a benchmark of 39,016 trajectories and 117,048 human-verified instructions that supports reproducible research across a range of environments and instruction styles. Using this dataset, we develop LCVN frameworks that link language grounding, future-state prediction, and action generation through two complementary model families. The first family combines LCVN-WM, a diffusion-based world model, with LCVN-AC, an actor-critic agent trained in the latent space of the world model. The second family, LCVN-Uni, adopts an autoregressive multimodal architecture that predicts both actions and future observations. Experiments show that these families offer different advantages: the former provides more temporally coherent rollouts, whereas the latter generalizes better to unseen environments. Taken together, these observations point to the value of jointly studying language grounding, imagination, and policy learning in a unified task setting, and LCVN provides a concrete basis for further investigation of language-conditioned world models. The code is available at https://github.com/F1y1113/LCVN.
Problem

Research questions and friction points this paper is trying to address.

language-conditioned visual navigation
embodied agent
natural language instruction
world modeling
language grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

language-conditioned navigation
world modeling
diffusion-based world model
autoregressive multimodal architecture
open-loop trajectory prediction
🔎 Similar Papers
No similar papers found.