🤖 AI Summary
To address temporal inconsistency, perception-action misalignment, and computational redundancy caused by fixed-step inference in Vision-and-Language Navigation (VLN), this paper proposes AdaNav, an uncertainty-driven adaptive inference framework. Methodologically, it introduces a lightweight Uncertainty-Aware Routing (UAR) module that quantifies policy uncertainty via action entropy; designs a dynamic inference triggering mechanism for difficulty-aware sparse decision-making; and adopts a progressive training paradigm combining heuristic path simulation with reinforcement learning fine-tuning. Evaluated with only 6K training samples, AdaNav achieves +20% Success Rate on R2R val-unseen, +11.7% on RxR-CE, and +11.4% navigation accuracy in real-world scenarios—significantly outperforming proprietary models trained on million-scale datasets. The framework demonstrates that uncertainty-guided adaptive inference enables highly efficient and robust VLN without requiring massive annotated data.
📝 Abstract
Vision Language Navigation (VLN) requires agents to follow natural language instructions by grounding them in sequential visual observations over long horizons. Explicit reasoning could enhance temporal consistency and perception action alignment, but reasoning at fixed steps often leads to suboptimal performance and unnecessary computation. To address this, we propose AdaNav, an uncertainty-based adaptive reasoning framework for VLN. At its core is the Uncertainty Adaptive Reasoning Block (UAR), a lightweight plugin that dynamically triggers reasoning. We introduce Action Entropy as a policy prior for UAR and progressively refine it through a Heuristics to RL training method, enabling agents to learn difficulty aware reasoning policies under the strict data limitations of embodied tasks. Results show that with only 6K training samples, AdaNav achieves substantial gains over closed source models trained on million scale data, improving success rate by 20% on R2R val-unseen, 11.7% on RxR-CE, and 11.4% in real world scenes. The code is available at https://github.com/xinding-sys/AdaNav.