Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

📅 2026-04-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

207K/year
🤖 AI Summary
This work addresses the challenge of long-horizon visual navigation failures caused by state drift—specifically progress drift and memory drift—by introducing a dual-anchor framework. The approach explicitly models instruction progress to delineate completed and pending subgoals, while integrating a landmark-centric world model that retrospectively aligns historical observations for accurate state representation. Innovatively combining progress awareness with landmark-based memory, the study also constructs the first large-scale dataset featuring explicit progress annotations and landmark labels. Leveraging a Video-LLM pipeline integrated with the Segment Anything Model, object embeddings are extracted to enrich scene understanding. Experimental results demonstrate a 15.2% absolute improvement in overall success rate and a substantial 24.7% gain on long-trajectory tasks, validated across both simulated and real-world environments.

Technology Category

Application Category

📝 Abstract
Vision-Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While recent Video Large Language Models(Video-LLMs) have largely advanced VLN, they remain highly susceptible to State Drift in long scenarios. In these cases, the agent's internal state drifts away from the true task execution state, leading to aimless wandering and failure to execute essential maneuvers in the instruction. We attribute this failure to two distinct cognitive deficits: Progress Drift, where the agent fails to distinguish completed sub-goals from remaining ones, and Memory Drift, where the agent's history representations degrade, making it lose track of visited landmarks. In this paper, we propose a Dual-Anchoring Framework that explicitly anchors the instruction progress and history representations. First, to address progress drift, we introduce Instruction Progress Anchoring, which supervises the agent to generate structured text tokens that delineate completed versus remaining sub-goals. Second, to mitigate memory drift, we propose Memory Landmark Anchoring, which utilizes a Landmark-Centric World Model to retrospectively predict object-centric embeddings extracted by the Segment Anything Model, compelling the agent to explicitly verify past observations and preserve distinct representations of visited landmarks. Facilitating this framework, we curate two extensive datasets: 3.6 million samples with explicit progress descriptions, and 937k grounded landmark data for retrospective verification. Extensive experiments in both simulation and real-world environments demonstrate the superiority of our method, achieving a 15.2% improvement in Success Rate and a remarkable 24.7% gain on long-horizon trajectories. To facilitate further research, we will release our code, data generation pipelines, and the collected datasets.
Problem

Research questions and friction points this paper is trying to address.

State Drift
Vision-Language Navigation
Progress Drift
Memory Drift
Video-LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-Anchoring
State Drift
Vision-Language Navigation
Instruction Progress Anchoring
Memory Landmark Anchoring
K
Kangyi Wu
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
P
Pengna Li
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
K
Kailin Lyu
Institute of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences
Lin Zhao
Lin Zhao
Beijing Institute of Technology; JD Explore Academy
Embodied AIRobot Learning
Q
Qingrong He
Joy Future Academy, JD
Jinjun Wang
Jinjun Wang
Xian Jiaotong University, China
J
Jianyi Liu
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University