\textsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation

📅 2026-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the instability, poor generalization, and error accumulation in existing vision-and-language navigation (VLN) methods, which stem from a lack of explicit modeling of the causal relationship between actions and visual changes. To this end, we propose NaVIDA, an inverse dynamics–enhanced framework that, for the first time in VLN, explicitly models the visual-action causality. NaVIDA integrates inverse dynamics supervision, a hierarchical probabilistic action chunking (HPAC) mechanism, and an entropy-guided adaptive execution strategy to enable structured long-horizon planning and robust execution. Despite using significantly fewer parameters (3B vs. 8B), our method surpasses current state-of-the-art models in performance and demonstrates effective deployment on real-world robotic platforms.

Technology Category

Application Category

📝 Abstract
Vision-and-Language Navigation (VLN) requires agents to interpret natural language instructions and act coherently in visually rich environments. However, most existing methods rely on reactive state-action mappings without explicitly modeling how actions causally transform subsequent visual observations. Lacking such vision-action causality, agents cannot anticipate the visual changes induced by its own actions, leading to unstable behaviors, weak generalization, and cumulative error along trajectory. To address these issues, we introduce \textsc{NaVIDA} (\textbf{Nav}igation with \textbf{I}nverse \textbf{D}ynamics \textbf{A}ugmentation), a unified VLN framework that couples policy learning with action-grounded visual dynamics and adaptive execution. \textsc{NaVIDA} augments training with chunk-based inverse-dynamics supervision to learn causal relationship between visual changes and corresponding actions. To structure this supervision and extend the effective planning range, \textsc{NaVIDA} employs hierarchical probabilistic action chunking (HPAC), which organizes trajectories into multi-step chunks and provides discriminative, longer-range visual-change cues. To further curb error accumulation and stabilize behavior at inference, an entropy-guided mechanism adaptively sets the execution horizon of action chunks. Extensive experiments show that \textsc{NaVIDA} achieves superior navigation performance compared to state-of-the-art methods with fewer parameters (3B vs. 8B). Real-world robot evaluations further validate the practical feasibility and effectiveness of our approach. Code and data will be available upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Vision-and-Language Navigation
action-grounded visual dynamics
causal relationship
error accumulation
visual-action causality
Innovation

Methods, ideas, or system contributions that make the work stand out.

inverse dynamics
vision-language navigation
action chunking
causal modeling
entropy-guided execution
🔎 Similar Papers
No similar papers found.