DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation

📅 2025-08-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the global suboptimality and performance bottlenecks inherent in two-stage waypoint planning for Vision-and-Language Navigation with Continuous Environments (VLN-CE), this paper proposes the first end-to-end diffusion policy model that jointly models waypoint generation and path execution. We innovatively introduce conditional diffusion models into VLN, enabling direct learning of the multimodal (vision-language-action) joint distribution in continuous action space. Our approach integrates DAgger-based online expert demonstration, trajectory augmentation, and policy fine-tuning to enhance robustness and behavioral diversity in long-horizon navigation. On the R2R and CVD benchmarks, our method significantly outperforms existing two-stage approaches—even without relying on external waypoint predictors—demonstrating superior effectiveness and generalization in complex instruction grounding and continuous control. This work establishes diffusion policies as a viable and powerful paradigm for end-to-end VLN-CE.

Technology Category

Application Category

📝 Abstract
Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural language instructions through free-form 3D spaces. Existing VLN-CE approaches typically use a two-stage waypoint planning framework, where a high-level waypoint predictor generates the navigable waypoints, and then a navigation planner suggests the intermediate goals in the high-level action space. However, this two-stage decomposition framework suffers from: (1) global sub-optimization due to the proxy objective in each stage, and (2) a performance bottleneck caused by the strong reliance on the quality of the first-stage predicted waypoints. To address these limitations, we propose DAgger Diffusion Navigation (DifNav), an end-to-end optimized VLN-CE policy that unifies the traditional two stages, i.e. waypoint generation and planning, into a single diffusion policy. Notably, DifNav employs a conditional diffusion policy to directly model multi-modal action distributions over future actions in continuous navigation space, eliminating the need for a waypoint predictor while enabling the agent to capture multiple possible instruction-following behaviors. To address the issues of compounding error in imitation learning and enhance spatial reasoning in long-horizon navigation tasks, we employ DAgger for online policy training and expert trajectory augmentation, and use the aggregated data to further fine-tune the policy. This approach significantly improves the policy's robustness and its ability to recover from error states. Extensive experiments on benchmark datasets demonstrate that, even without a waypoint predictor, the proposed method substantially outperforms previous state-of-the-art two-stage waypoint-based models in terms of navigation performance. Our code is available at: https://github.com/Tokishx/DifNav.
Problem

Research questions and friction points this paper is trying to address.

Overcoming sub-optimization in two-stage VLN-CE navigation frameworks
Eliminating reliance on waypoint predictor performance bottlenecks
Improving robustness in long-horizon instruction-following navigation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end diffusion policy for VLN-CE
DAgger-enhanced online training for robustness
Multi-modal action modeling in continuous space
🔎 Similar Papers
No similar papers found.