FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery

📅 2026-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models exhibit limited performance on synthetic aperture radar (SAR) imagery, primarily due to the complex SAR imaging mechanism, sensitivity to scattering characteristics, and a scarcity of high-quality image–text paired data. To address these challenges, this work constructs the first SAR image–text–AlphaEarth feature triplet dataset and introduces FUSAR-GPT, a specialized vision-language model. FUSAR-GPT integrates geospatial baseline priors, multi-source remote sensing temporal features, and spatiotemporal anchor embeddings to enable dynamic feature compensation. Furthermore, it employs a two-stage supervised fine-tuning strategy that effectively decouples knowledge injection from task execution. Experimental results demonstrate that FUSAR-GPT significantly outperforms state-of-the-art baselines across multiple remote sensing vision-language benchmarks, achieving performance gains exceeding 12%.

Technology Category

Application Category

📝 Abstract
Research on the intelligent interpretation of all-weather, all-time Synthetic Aperture Radar (SAR) is crucial for advancing remote sensing applications. In recent years, although Visual Language Models (VLMs) have demonstrated strong open-world understanding capabilities on RGB images, their performance is severely limited when directly applied to the SAR field due to the complexity of the imaging mechanism, sensitivity to scattering features, and the scarcity of high-quality text corpora. To systematically address this issue, we constructed the inaugural SAR Image-Text-AlphaEarth feature triplet dataset and developed FUSAR-GPT, a VLM specifically for SAR. FUSAR-GPT innovatively introduces a geospatial baseline model as a 'world knowledge' prior and embeds multi-source remote-sensing temporal features into the model's visual backbone via 'spatiotemporal anchors', enabling dynamic compensation for the sparse representation of targets in SAR images. Furthermore, we designed a two-stage SFT strategy to decouple the knowledge injection and task execution of large models. The spatiotemporal feature embedding and the two-stage decoupling paradigm enable FUSAR-GPT to achieve state-of-the-art performance across several typical remote sensing visual-language benchmark tests, significantly outperforming mainstream baseline models by over 12%.
Problem

Research questions and friction points this paper is trying to address.

Synthetic Aperture Radar
Visual Language Model
SAR imagery
intelligent interpretation
text corpora
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatiotemporal Feature Embedding
Two-Stage Decoupled SFT
Visual Language Model for SAR
Geospatial Prior
SAR Image-Text Dataset
🔎 Similar Papers
No similar papers found.
Xiaokun Zhang
Xiaokun Zhang
City University of Hong Kong, Dalian University of Technology
Data miningRecommendationNLP
Y
Yi Yang
Fudan University
Z
Ziqi Ye
Fudan University
B
Baiyun
Fudan University
X
Xiaorong Guo
Fudan University
Q
Qingchen Fang
Fudan University
R
Ruyi Zhang
Fudan University
X
Xinpeng Zhou
Fudan University
Haipeng Wang
Haipeng Wang
Fudan University
synthetic aperture radarimage processingsignal processing