GoViG: Goal-Conditioned Visual Navigation Instruction Generation

๐Ÿ“… 2025-08-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper addresses the problem of generating accurate, contextually coherent navigation instructions solely from first-person initial and goal imagesโ€”without requiring semantic annotations or structured environmental priors such as maps. To this end, we propose a joint vision-prediction and instruction-generation framework that innovatively incorporates both single-step and interleaved multimodal reasoning mechanisms to emulate human-like incremental spatial reasoning. Our approach is built upon an autoregressive multimodal large language model, integrating visual state prediction, cross-modal alignment training, and end-to-end instruction synthesis. Evaluated on our newly constructed R2R-Goal dataset, the method achieves significant improvements in BLEU-4 and CIDEr scores over prior state-of-the-art approaches and demonstrates strong cross-domain generalization capability.

Technology Category

Application Category

๐Ÿ“ Abstract
We introduce Goal-Conditioned Visual Navigation Instruction Generation (GoViG), a new task that aims to autonomously generate precise and contextually coherent navigation instructions solely from egocentric visual observations of initial and goal states. Unlike conventional approaches that rely on structured inputs such as semantic annotations or environmental maps, GoViG exclusively leverages raw egocentric visual data, substantially improving its adaptability to unseen and unstructured environments. Our method addresses this task by decomposing it into two interconnected subtasks: (1) visual forecasting, which predicts intermediate visual states bridging the initial and goal views; and (2) instruction generation, which synthesizes linguistically coherent instructions grounded in both observed and anticipated visuals. These subtasks are integrated within an autoregressive multimodal large language model trained with tailored objectives to ensure spatial accuracy and linguistic clarity. Furthermore, we introduce two complementary multimodal reasoning strategies, one-pass and interleaved reasoning, to mimic incremental human cognitive processes during navigation. To evaluate our method, we propose the R2R-Goal dataset, combining diverse synthetic and real-world trajectories. Empirical results demonstrate significant improvements over state-of-the-art methods, achieving superior BLEU-4 and CIDEr scores along with robust cross-domain generalization.
Problem

Research questions and friction points this paper is trying to address.

Generates navigation instructions from visual observations
Uses raw visual data for adaptability in unseen environments
Integrates visual forecasting and instruction generation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages raw egocentric visual data only
Decomposes task into visual forecasting and instruction generation
Integrates autoregressive multimodal large language model
๐Ÿ”Ž Similar Papers
No similar papers found.