AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the limitations of existing vision-and-language navigation methods, which struggle to explicitly model interpretable relationships among agent states, instructions, and environments, and often rely on 3D sensors that constrain pretraining scalability. To overcome these challenges, the authors propose an end-to-end, self-aware reasoning framework that operates without additional 3D perception. The approach integrates a structured reasoning module to enhance spatial and task-oriented self-state understanding and employs an automated data engine that partitions training samples based on task progress to improve learning efficiency. Evaluated on multiple benchmarks in the Habitat simulator, the method significantly outperforms current state-of-the-art approaches, achieving both high performance and interpretability in vision-and-language navigation.

📝 Abstract

Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit and explainable understanding of the relationships between the agent, the instruction, and the scene. Conversely, explicitly building a scene map for heuristic planning is intuitively appealing but relies on additional 3D sensors and hinders large-scale vision-language pre-training. To bridge this gap, we propose AwareVLN, a novel framework that equips the navigation model with a self-aware reasoning mechanism, enabling it to understand the agent's state and task progress in a fully end-to-end and data-driven manner. Our approach features two key innovations: (1) a structural reasoning module that fosters spatial and task-oriented self-awareness, and (2) an automatic data engine with progress division for effective training. Extensive experiments on various datasets in Habitat simulator show our AwareVLN significantly outperforms previous state-of-the-art vision-language navigation methods. Project page: https://gwxuan.github.io/AwareVLN/.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Navigation

Self-awareness

Scene Understanding

End-to-end Learning

Agent State Reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-aware reasoning

structural reasoning module

vision-language navigation