Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation

📅 2025-03-23

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Vision-Language Navigation (VLN) suffers from poor generalization to unseen environments and limited training data. To address these challenges, we propose Rewrite-Augmented Multimodality (RAM), a simulator-free, annotation-free data augmentation paradigm that semantically preserves and rewrites existing observation-instruction pairs to generate novel training samples. RAM introduces two novel mechanisms: object-augmented observation rewriting and observation-contrastive instruction rewriting. We further design a hybrid-focusing training strategy combined with random cropping to balance sample diversity and noise robustness. The method synergistically integrates vision-language models (VLMs), large language models (LLMs), and text-to-image models (T2IMs) to enable cross-modal observation synthesis and semantically aligned instruction generation. RAM achieves significant improvements in zero-shot generalization across both discrete and continuous VLN benchmarks—including R2R, REVERIE, R4R, and R2R-CE—without requiring environment simulators or human annotations. Our code is publicly available.

Technology Category

Application Category

📝 Abstract

Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method. Code is available at https://github.com/SaDil13/VLN-RAM.

Problem

Research questions and friction points this paper is trying to address.

Addresses data scarcity in Vision-Language Navigation (VLN) for unseen environments

Proposes rewriting human-annotated data to create diverse observation-instruction pairs

Enhances generalization using simulator-free, labor-saving object and instruction rewriting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rewriting human-annotated data for unseen pairs

Combining VLMs and LLMs for object-enriched descriptions

Observation-contrast instruction rewriting via LLMs

🔎 Similar Papers

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models