SLIP-RS: Structured-Attribute Language-Image Pre-Training for Remote Sensing Object Detection

📅 2026-05-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

182K/year
🤖 AI Summary
This work addresses the limitations of existing remote sensing object detection methods, which rely on holistic label learning and struggle to achieve fine-grained representations under data scarcity. To overcome this, we propose a structured attribute disentanglement paradigm that maps open-ended categories into a compact, physically meaningful attribute space, enabling fine-grained discrimination through structured logical reasoning. We introduce RS-Attribute-15M, the largest remote sensing attribute dataset to date, comprising over 15 million attribute annotations, and enhance supervision reliability via conformal prediction theory. Additionally, we design a structured attribute contrastive learning framework, a compositional attribute augmentation strategy, and a language–image pretraining mechanism. The proposed approach significantly outperforms current state-of-the-art methods in both fine-grained detection and cross-domain generalization.
📝 Abstract
Existing language-image pre-training for remote sensing object detection is constrained by Monolithic Label Learning, which relies on exhaustively enumerating open-set categories via black-box data to acquire fine-grained representations, creating a dependency incompatible with the domain's inherent data scarcity. To transcend this bottleneck, we propose SLIP-RS, establishing a Structured-Attribute Decoupling Paradigm that maps the open-ended category space into a finite, physically meaningful attribute space, unlocking fine-grained discriminability via explicit structural logic. This paradigm is realized via two technical pillars: (1) Structured-Attribute Contrastive Learning, which enforces the learning of decoupled intrinsic visual logic via combinatorial attribute augmentation; and (2) Conformal Attribute Reliability Engine, which leverages conformal prediction theory to rigorously distill high-fidelity supervision from noisy sources, yielding RS-Attribute-15M, the largest dataset with over 15 million attribute annotations. Extensive experiments demonstrate that SLIP-RS establishes unprecedented performance in fine-grained detection and cross-domain generalization, validating structured attributes as a vital foundation for remote sensing. Code: https://github.com/facias914/SLIP-RS.
Problem

Research questions and friction points this paper is trying to address.

language-image pre-training
remote sensing object detection
monolithic label learning
data scarcity
fine-grained representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured-Attribute Decoupling
Contrastive Learning
Conformal Prediction
Remote Sensing Object Detection
Attribute-Based Representation
🔎 Similar Papers
2024-09-20IEEE Transactions on Geoscience and Remote SensingCitations: 2