FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
CLIP’s global alignment mechanism struggles with fine-grained understanding of remote sensing (RS) imagery, and existing RS-CLIP variants suffer from coarse-grained image-text supervision and ineffective region-level alignment. To address these limitations, we propose FarSLIP: (1) We introduce MGRS-200k—the first large-scale, multi-granularity RS image-text dataset—explicitly modeling both object-level and scene-level semantics. (2) We design a patch-to-patch distillation strategy that preserves CLIP’s semantic consistency while enabling local visual alignment. (3) We propose a CLS token-guided region–category alignment mechanism, replacing error-prone explicit region detection to enhance spatial awareness. Extensive experiments demonstrate that FarSLIP consistently outperforms state-of-the-art methods on RS open-vocabulary semantic segmentation, zero-shot classification, and cross-modal retrieval—achieving substantial gains in fine-grained cross-modal understanding.

Technology Category

Application Category

📝 Abstract
As CLIP's global alignment limits its ability to capture fine-grained details, recent efforts have focused on enhancing its region-text alignment. However, current remote sensing (RS)-specific CLIP variants still inherit this limited spatial awareness. We identify two key limitations behind this: (1) current RS image-text datasets generate global captions from object-level labels, leaving the original object-level supervision underutilized; (2) despite the success of region-text alignment methods in general domain, their direct application to RS data often leads to performance degradation. To address these, we construct the first multi-granularity RS image-text dataset, MGRS-200k, featuring rich object-level textual supervision for RS region-category alignment. We further investigate existing fine-grained CLIP tuning strategies and find that current explicit region-text alignment methods, whether in a direct or indirect way, underperform due to severe degradation of CLIP's semantic coherence. Building on these, we propose FarSLIP, a Fine-grained Aligned RS Language-Image Pretraining framework. Rather than the commonly used patch-to-CLS self-distillation, FarSLIP employs patch-to-patch distillation to align local and global visual cues, which improves feature discriminability while preserving semantic coherence. Additionally, to effectively utilize region-text supervision, it employs simple CLS token-based region-category alignment rather than explicit patch-level alignment, further enhancing spatial awareness. FarSLIP features improved fine-grained vision-language alignment in RS domain and sets a new state of the art not only on RS open-vocabulary semantic segmentation, but also on image-level tasks such as zero-shot classification and image-text retrieval. Our dataset, code, and models are available at https://github.com/NJU-LHRS/FarSLIP.
Problem

Research questions and friction points this paper is trying to address.

CLIP's global alignment limits fine-grained detail capture in remote sensing
Current remote sensing CLIP variants inherit limited spatial awareness capabilities
Direct region-text alignment methods degrade performance on remote sensing data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-granularity dataset with object-level textual supervision
Patch-to-patch distillation preserving semantic coherence
CLS token-based region-category alignment enhancing spatial awareness
🔎 Similar Papers
No similar papers found.