Disentangling Foreground and Background for vision-Language Navigation via Online Augmentation

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

In Vision-and-Language Navigation (VLN), entanglement of foreground and background information hinders agent generalization to unseen environments. To address this, we propose a foreground-background disentangled online feature enhancement framework. First, semantic-enhanced landmark identification explicitly separates foreground from background in visual observations. Second, we introduce a consensus-driven dynamic feature selection mechanism that performs adaptive representation optimization during navigation via a two-stage voting process—integrating multi-instruction alignment and position-aware cues. Notably, this is the first work to incorporate dynamic consensus guidance into VLN feature selection, fully end-to-end integrated with vision-language pretrained models. Our method achieves state-of-the-art generalization performance on REVERIE and R2R benchmarks, significantly outperforming prior baselines. Empirical results validate that foreground-background disentanglement is critical for robust, generalizable navigation.

Technology Category

Application Category

📝 Abstract

Following language instructions, vision-language navigation (VLN) agents are tasked with navigating unseen environments. While augmenting multifaceted visual representations has propelled advancements in VLN, the significance of foreground and background in visual observations remains underexplored. Intuitively, foreground regions provide semantic cues, whereas the background encompasses spatial connectivity information. Inspired on this insight, we propose a Consensus-driven Online Feature Augmentation strategy (COFA) with alternative foreground and background features to facilitate the navigable generalization. Specifically, we first leverage semantically-enhanced landmark identification to disentangle foreground and background as candidate augmented features. Subsequently, a consensus-driven online augmentation strategy encourages the agent to consolidate two-stage voting results on feature preferences according to diverse instructions and navigational locations. Experiments on REVERIE and R2R demonstrate that our online foreground-background augmentation boosts the generalization of baseline and attains state-of-the-art performance.

Problem

Research questions and friction points this paper is trying to address.

Disentangling foreground and background features for vision-language navigation agents

Enhancing generalization through online augmentation of visual representations

Addressing underexplored significance of semantic and spatial visual cues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online augmentation strategy disentangles foreground and background features

Semantically-enhanced landmark identification separates candidate augmented features

Consensus-driven voting consolidates feature preferences across navigation stages

🔎 Similar Papers

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models