🤖 AI Summary
Existing vision-language models (VLMs) lack systematic investigation for multi-temporal remote sensing understanding and fail to unify modeling across single-image, bi-temporal image-pair, and video inputs. To address this, we propose the first unified VLM framework tailored for multi-temporal remote sensing tasks. Our method introduces a bi-temporal change extraction module to explicitly model land-surface dynamics; incorporates a reasoning-oriented prompt enhancement mechanism to improve temporal semantic alignment; and employs multi-task joint fine-tuning to enable cross-task knowledge sharing. Evaluated on three core tasks—remote sensing visual question answering, change description generation, and video scene classification—the framework achieves state-of-the-art performance. It significantly enhances temporal modeling accuracy and cross-task generalization capability. This work establishes a scalable, unified paradigm for multimodal temporal understanding in remote sensing.
📝 Abstract
The domain gap between remote sensing imagery and natural images has recently received widespread attention and Vision-Language Models (VLMs) have demonstrated excellent generalization performance in remote sensing multimodal tasks. However, current research is still limited in exploring how remote sensing VLMs handle different types of visual inputs. To bridge this gap, we introduce extbf{UniRS}, the first vision-language model extbf{uni}fying multi-temporal extbf{r}emote extbf{s}ensing tasks across various types of visual input. UniRS supports single images, dual-time image pairs, and videos as input, enabling comprehensive remote sensing temporal analysis within a unified framework. We adopt a unified visual representation approach, enabling the model to accept various visual inputs. For dual-time image pair tasks, we customize a change extraction module to further enhance the extraction of spatiotemporal features. Additionally, we design a prompt augmentation mechanism tailored to the model's reasoning process, utilizing the prior knowledge of the general-purpose VLM to provide clues for UniRS. To promote multi-task knowledge sharing, the model is jointly fine-tuned on a mixed dataset. Experimental results show that UniRS achieves state-of-the-art performance across diverse tasks, including visual question answering, change captioning, and video scene classification, highlighting its versatility and effectiveness in unifying these multi-temporal remote sensing tasks. Our code and dataset will be released soon.