UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models

📅 2024-12-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) lack systematic investigation for multi-temporal remote sensing understanding and fail to unify modeling across single-image, bi-temporal image-pair, and video inputs. To address this, we propose the first unified VLM framework tailored for multi-temporal remote sensing tasks. Our method introduces a bi-temporal change extraction module to explicitly model land-surface dynamics; incorporates a reasoning-oriented prompt enhancement mechanism to improve temporal semantic alignment; and employs multi-task joint fine-tuning to enable cross-task knowledge sharing. Evaluated on three core tasks—remote sensing visual question answering, change description generation, and video scene classification—the framework achieves state-of-the-art performance. It significantly enhances temporal modeling accuracy and cross-task generalization capability. This work establishes a scalable, unified paradigm for multimodal temporal understanding in remote sensing.

Technology Category

Application Category

📝 Abstract
The domain gap between remote sensing imagery and natural images has recently received widespread attention and Vision-Language Models (VLMs) have demonstrated excellent generalization performance in remote sensing multimodal tasks. However, current research is still limited in exploring how remote sensing VLMs handle different types of visual inputs. To bridge this gap, we introduce extbf{UniRS}, the first vision-language model extbf{uni}fying multi-temporal extbf{r}emote extbf{s}ensing tasks across various types of visual input. UniRS supports single images, dual-time image pairs, and videos as input, enabling comprehensive remote sensing temporal analysis within a unified framework. We adopt a unified visual representation approach, enabling the model to accept various visual inputs. For dual-time image pair tasks, we customize a change extraction module to further enhance the extraction of spatiotemporal features. Additionally, we design a prompt augmentation mechanism tailored to the model's reasoning process, utilizing the prior knowledge of the general-purpose VLM to provide clues for UniRS. To promote multi-task knowledge sharing, the model is jointly fine-tuned on a mixed dataset. Experimental results show that UniRS achieves state-of-the-art performance across diverse tasks, including visual question answering, change captioning, and video scene classification, highlighting its versatility and effectiveness in unifying these multi-temporal remote sensing tasks. Our code and dataset will be released soon.
Problem

Research questions and friction points this paper is trying to address.

Visual Language Models
Remote Sensing Imagery
Temporal Earth Observation
Innovation

Methods, ideas, or system contributions that make the work stand out.

UniRS Model
Multi-Temporal Scale
Visual Language Processing
🔎 Similar Papers
No similar papers found.
Y
Yujie Li
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China
W
Wenjia Xu
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China
G
Guangzuo Li
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
Z
Zijian Yu
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China
Z
Zhiwei Wei
School of Geographic Sciences, Hunan Normal University
J
Jiuniu Wang
City University of Hong Kong, Hong Kong SAR
Mugen Peng
Mugen Peng
Beijing University of Posts & Telecommun., IEEE Fellow, Web of Science Highly Cited Researcher
Fog ComputingCloud Radio Access NetworksIntegrated Satellite-Terrestrial Networks6G