🤖 AI Summary
This work addresses the challenge of long-horizon visual navigation failures caused by state drift—specifically progress drift and memory drift—by introducing a dual-anchor framework. The approach explicitly models instruction progress to delineate completed and pending subgoals, while integrating a landmark-centric world model that retrospectively aligns historical observations for accurate state representation. Innovatively combining progress awareness with landmark-based memory, the study also constructs the first large-scale dataset featuring explicit progress annotations and landmark labels. Leveraging a Video-LLM pipeline integrated with the Segment Anything Model, object embeddings are extracted to enrich scene understanding. Experimental results demonstrate a 15.2% absolute improvement in overall success rate and a substantial 24.7% gain on long-trajectory tasks, validated across both simulated and real-world environments.
📝 Abstract
Vision-Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While recent Video Large Language Models(Video-LLMs) have largely advanced VLN, they remain highly susceptible to State Drift in long scenarios. In these cases, the agent's internal state drifts away from the true task execution state, leading to aimless wandering and failure to execute essential maneuvers in the instruction. We attribute this failure to two distinct cognitive deficits: Progress Drift, where the agent fails to distinguish completed sub-goals from remaining ones, and Memory Drift, where the agent's history representations degrade, making it lose track of visited landmarks. In this paper, we propose a Dual-Anchoring Framework that explicitly anchors the instruction progress and history representations. First, to address progress drift, we introduce Instruction Progress Anchoring, which supervises the agent to generate structured text tokens that delineate completed versus remaining sub-goals. Second, to mitigate memory drift, we propose Memory Landmark Anchoring, which utilizes a Landmark-Centric World Model to retrospectively predict object-centric embeddings extracted by the Segment Anything Model, compelling the agent to explicitly verify past observations and preserve distinct representations of visited landmarks. Facilitating this framework, we curate two extensive datasets: 3.6 million samples with explicit progress descriptions, and 937k grounded landmark data for retrospective verification. Extensive experiments in both simulation and real-world environments demonstrate the superiority of our method, achieving a 15.2% improvement in Success Rate and a remarkable 24.7% gain on long-horizon trajectories. To facilitate further research, we will release our code, data generation pipelines, and the collected datasets.