🤖 AI Summary
Existing vision-language navigation methods struggle with accurate alignment between natural language instructions and environmental states in complex scenes due to insufficient commonsense reasoning capabilities, leading to navigation errors. This paper proposes a knowledge-enhanced navigation framework. Its core contributions are: (1) a landmark-based Knowledge-Guided Alignment (KGL) mechanism that achieves fine-grained cross-modal alignment among language, vision, and historical states; and (2) a Knowledge-Guided Dynamic Augmentation (KGDA) module that mitigates data bias introduced by external knowledge integration. We construct a structured commonsense knowledge base containing 630K descriptive entries, and incorporate knowledge matching, landmark-aware attention, and multimodal dynamic fusion to refine decision-making. Evaluated on the R2R and REVERIE benchmarks, our method achieves significant improvements—+3.2% navigation success rate, +4.1% SPL (Success-weighted by Path Length), and a 19.7% reduction in trajectory error—demonstrating the critical role of explicit commonsense knowledge modeling in embodied navigation.
📝 Abstract
Vision-and-language navigation is one of the core tasks in embodied intelligence, requiring an agent to autonomously navigate in an unfamiliar environment based on natural language instructions. However, existing methods often fail to match instructions with environmental information in complex scenarios, one reason being the lack of common-sense reasoning ability. This paper proposes a vision-and-language navigation method called Landmark-Guided Knowledge (LGK), which introduces an external knowledge base to assist navigation, addressing the misjudgment issues caused by insufficient common sense in traditional methods. Specifically, we first construct a knowledge base containing 630,000 language descriptions and use knowledge Matching to align environmental subviews with the knowledge base, extracting relevant descriptive knowledge. Next, we design a Knowledge-Guided by Landmark (KGL) mechanism, which guides the agent to focus on the most relevant parts of the knowledge by leveraging landmark information in the instructions, thereby reducing the data bias that may arise from incorporating external knowledge. Finally, we propose Knowledge-Guided Dynamic Augmentation (KGDA), which effectively integrates language, knowledge, vision, and historical information. Experimental results demonstrate that the LGK method outperforms existing state-of-the-art methods on the R2R and REVERIE vision-and-language navigation datasets, particularly in terms of navigation error, success rate, and path efficiency.