Beyond Textual Knowledge-Leveraging Multimodal Knowledge Bases for Enhancing Vision-and-Language Navigation

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses limitations in existing vision-and-language navigation methods regarding semantic understanding and cross-modal alignment. The authors propose a multimodal knowledge-enhanced framework that synergistically leverages both a large-model-generated image knowledge base (R2R-GP, REVERIE-GP) and a text knowledge base constructed from panoramic views. Specifically, Qwen3-4B extracts target phrases, Flux-Schnell generates image-based knowledge, and BLIP-2 constructs textual knowledge. To effectively integrate these modalities, the framework introduces a Goal-Aware Augmentor and a Knowledge Augmentor for goal-conditioned knowledge fusion. Evaluated on the R2R and REVERIE benchmarks, the method significantly outperforms baseline approaches, achieving absolute improvements of 5.0% and 2.07% in Success Rate (SR), and 4.0% and 3.69% in Success weighted by Path Length (SPL) in unseen environments, respectively.
📝 Abstract
Vision-and-Language Navigation (VLN) requires an agent to navigate through complex unseen environments based on natural language instructions. However, existing methods often struggle to effectively capture key semantic cues and accurately align them with visual observations. To address this limitation, we propose Beyond Textual Knowledge (BTK), a VLN framework that synergistically integrates environment-specific textual knowledge with generative image knowledge bases. BTK employs Qwen3-4B to extract goal-related phrases and utilizes Flux-Schnell to construct two large-scale image knowledge bases: R2R-GP and REVERIE-GP. Additionally, we leverage BLIP-2 to construct a large-scale textual knowledge base derived from panoramic views, providing environment-specific semantic cues. These multimodal knowledge bases are effectively integrated via the Goal-Aware Augmentor and Knowledge Augmentor, significantly enhancing semantic grounding and cross-modal alignment. Extensive experiments on the R2R dataset with 7,189 trajectories and the REVERIE dataset with 21,702 instructions demonstrate that BTK significantly outperforms existing baselines. On the test unseen splits of R2R and REVERIE, SR increased by 5% and 2.07% respectively, and SPL increased by 4% and 3.69% respectively. The source code is available at https://github.com/yds3/IPM-BTK/.
Problem

Research questions and friction points this paper is trying to address.

Vision-and-Language Navigation
semantic grounding
cross-modal alignment
multimodal knowledge
visual observations
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal knowledge bases
vision-and-language navigation
semantic grounding
cross-modal alignment
generative image knowledge
🔎 Similar Papers
No similar papers found.