DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language Navigation

📅 2025-04-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address two key bottlenecks in Vision-and-Language Navigation (VLN)—insufficient fine-grained understanding of language instructions and the lack of cross-modal object relation modeling—this paper proposes a Dual Object-Aware Enhanced Transformer framework. Methodologically, it introduces three modules: Text Semantic Extraction (TSE), Text Object-aware Enhancement (TOPA), and Image Object-aware Enhancement (IOPA). TOPA identifies and strengthens critical semantic phrases and object-action associations within instructions; IOPA jointly models cross-modal object relations in visual representations to uncover implicit structural cues. Evaluated on the R2R and REVERIE benchmarks, the approach achieves significant improvements in navigation success (+3.2% SPL) and path quality, demonstrating enhanced comprehension of complex instructions and robust decision-making. The framework establishes an interpretable and generalizable paradigm for cross-modal relational modeling in VLN.

Technology Category

Application Category

📝 Abstract
Vision-and-Language Navigation (VLN) is a challenging task where an agent must understand language instructions and navigate unfamiliar environments using visual cues. The agent must accurately locate the target based on visual information from the environment and complete tasks through interaction with the surroundings. Despite significant advancements in this field, two major limitations persist: (1) Many existing methods input complete language instructions directly into multi-layer Transformer networks without fully exploiting the detailed information within the instructions, thereby limiting the agent's language understanding capabilities during task execution; (2) Current approaches often overlook the modeling of object relationships across different modalities, failing to effectively utilize latent clues between objects, which affects the accuracy and robustness of navigation decisions. We propose a Dual Object Perception-Enhancement Network (DOPE) to address these issues to improve navigation performance. First, we design a Text Semantic Extraction (TSE) to extract relatively essential phrases from the text and input them into the Text Object Perception-Augmentation (TOPA) to fully leverage details such as objects and actions within the instructions. Second, we introduce an Image Object Perception-Augmentation (IOPA), which performs additional modeling of object information across different modalities, enabling the model to more effectively utilize latent clues between objects in images and text, enhancing decision-making accuracy. Extensive experiments on the R2R and REVERIE datasets validate the efficacy of the proposed approach.
Problem

Research questions and friction points this paper is trying to address.

Enhances language understanding in navigation tasks by extracting essential phrases.
Improves object relationship modeling across visual and textual modalities.
Boosts navigation accuracy by leveraging latent object clues effectively.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text Semantic Extraction enhances instruction details
Dual Object Perception-Augmentation models cross-modal relationships
TSE and IOPA improve navigation decision accuracy