🤖 AI Summary
This work addresses the challenge of catastrophic forgetting in remote sensing vision-language models when continually learning new modalities and tasks, a problem exacerbated by the absence of dedicated continual learning benchmarks. To bridge this gap, the authors introduce CLeaRS—the first benchmark specifically designed for continual learning in remote sensing—comprising 10 subsets and over 207,000 image-text pairs, supporting diverse multimodal and multitask scenarios. They further propose three evaluation protocols tailored to continual learning settings. Through systematic evaluation of state-of-the-art models and continual learning methods, the study reveals substantial forgetting across task, instruction, and modality transfer dimensions, demonstrating that existing approaches exhibit limited efficacy in remote sensing contexts. This benchmark and analysis provide foundational resources and standardized evaluation criteria to advance continual learning research in remote sensing.
📝 Abstract
Current remote sensing vision-language models (RS VLMs) demonstrate impressive performance in image interpretation but rely on static training data, limiting their ability to accommodate continuously emerging sensing modalities and downstream tasks. This exposes a fundamental challenge: enabling RS VLMs to continually adapt without catastrophic forgetting. Despite its practical importance, the continual learning capability of RS VLMs remains underexplored, and no dedicated benchmark currently exists. In this work, we present CLeaRS, a comprehensive benchmark for continual vision-language learning in remote sensing. CLeaRS comprises 10 curated subsets with over 207k image-text pairs, spanning diverse interpretation tasks, sensing modalities, and application scenarios. We further define three evaluation protocols: long-horizon, modality-incremental, and task-incremental settings, to systematically assess continual adaptation. Extensive benchmarking of diverse vision-language models reveals catastrophic forgetting across all settings. Moreover, representative continual learning methods, when adapted to RS VLMs, exhibit limited effectiveness in handling task, instruction, and modality transitions. Our findings underscore the need for developing continual learning methods tailored to RS VLMs.