🤖 AI Summary
Traditional vision-language navigation (VLN) agents struggle to adapt to continuously emerging novel environments and dynamic instructions during deployment. Method: We propose continual vision-language navigation (CVLN), a new paradigm supporting incremental learning and evaluation across multiple scene domains, encompassing both single-instruction and interactive dialogue navigation modes. To mitigate catastrophic forgetting, we introduce two fine-grained experience replay mechanisms: Perplexity-based Replay (PerpR), which quantifies task difficulty via language model perplexity, and Episode-wise Self-Replay (ESR), which enables action-level self-supervision at the logit layer. Our approach integrates continual learning, reinforcement learning, and joint vision-language modeling. Contribution/Results: Evaluated on the CVLN benchmark, PerpR and ESR consistently outperform existing continual learning methods, demonstrating the efficacy of fine-grained replay in sequential decision-making tasks and significantly improving knowledge retention and transfer.
📝 Abstract
In developing Vision-and-Language Navigation (VLN) agents that navigate to a destination using natural language instructions and visual cues, current studies largely assume a extit{train-once-deploy-once strategy}. We argue that this kind of strategy is less realistic, as deployed VLN agents are expected to encounter novel environments continuously through their lifetime. To facilitate more realistic setting for VLN agents, we propose Continual Vision-and-Language Navigation (CVLN) paradigm for agents to continually learn and adapt to changing environments. In CVLN, the agents are trained and evaluated incrementally across multiple extit{scene domains} (i.e., environments). We present two CVLN learning setups to consider diverse forms of natural language instructions: Initial-instruction based CVLN, focused on navigation via initial-instruction interpretation, and dialogue-based CVLN, designed for navigation through dialogue with other agents. We introduce two simple yet effective baseline methods, tailored to the sequential decision-making needs of CVLN: Perplexity Replay (PerpR) and Episodic Self-Replay (ESR), both employing a rehearsal mechanism. PerpR selects replay episodes based on episode difficulty, while ESR stores and revisits action logits from individual episode steps during training to refine learning. Experimental results indicate that while existing continual learning methods are insufficient for CVLN, PerpR and ESR outperform the comparison methods by effectively utilizing replay memory.