RingMo-Aerial: An Aerial Remote Sensing Foundation Model With A Affine Transformation Contrastive Learning

📅 2024-09-20
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Aerial remote sensing vision tasks face fundamental challenges including difficulty in detecting small objects, weak modeling of oblique-view geometry, and poor model generalization. To address these, we propose RingMo-Aerial—the first general-purpose foundation model tailored for aerial remote sensing. It introduces two key innovations: (1) a frequency-enhanced multi-head self-attention (FE-MSA) mechanism that improves robustness to scale variation and geometric distortion; and (2) an affine-transform-based contrastive pretraining paradigm that enhances geometric invariance. Furthermore, we design a lightweight ARSA-Adapter for parameter-efficient fine-tuning, enabling adaptive transfer across diverse downstream tasks. Evaluated on multiple benchmarks—including small-object detection and semantic segmentation—RingMo-Aerial achieves state-of-the-art performance, delivering a +5.2% AP gain in small-object detection and significantly improved cross-task generalization.

Technology Category

Application Category

📝 Abstract
Aerial Remote Sensing (ARS) vision tasks pose significant challenges due to the unique characteristics of their viewing angles. Existing research has primarily focused on algorithms for specific tasks, which have limited applicability in a broad range of ARS vision applications. This paper proposes the RingMo-Aerial model, aiming to fill the gap in foundation model research in the field of ARS vision. By introducing the Frequency-Enhanced Multi-Head Self-Attention (FE-MSA) mechanism and an affine transformation-based contrastive learning pre-training method, the model's detection capability for small targets is enhanced and optimized for the tilted viewing angles characteristic of ARS. Furthermore, the ARS-Adapter, an efficient parameter fine-tuning method, is proposed to improve the model's adaptability and effectiveness in various ARS vision tasks. Experimental results demonstrate that RingMo-Aerial achieves SOTA performance on multiple downstream tasks. This indicates the practicality and efficacy of RingMo-Aerial in enhancing the performance of ARS vision tasks.
Problem

Research questions and friction points this paper is trying to address.

Addresses limited applicability of existing ARS vision task algorithms
Enhances small target detection in tilted ARS viewing angles
Improves adaptability for diverse ARS vision tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Frequency-Enhanced Multi-Head Self-Attention mechanism
Affine transformation-based contrastive learning
ARS-Adapter for efficient parameter fine-tuning
🔎 Similar Papers
No similar papers found.
W
W. Diao
Aerospace Information Research Institute, Chinese Academy of Sciences
H
Haichen Yu
Aerospace Information Research Institute, Chinese Academy of Sciences
K
Kaiyue Kang
Aerospace Information Research Institute, Chinese Academy of Sciences
T
Tong Ling
Aerospace Information Research Institute, Chinese Academy of Sciences
D
Di Liu
Yingchao Feng
Yingchao Feng
Aerospace Information Research Institute, Chinese Academy of Sciences
Machine learning in visionStatistical and structural pattern recognitionImage/video analysis and understandingRemote sensing image understandingMachine learning and data mining with applications to remote sensing
H
Hanbo Bi
Aerospace Information Research Institute, Chinese Academy of Sciences
L
Libo Ren
Aerospace Information Research Institute, Chinese Academy of Sciences
X
Xuexue Li
Aerospace Information Research Institute, Chinese Academy of Sciences
Y
Yongqiang Mao
Department of Electronic Engineering, Tsinghua University
Xian Sun
Xian Sun
Aerospace Information Research Institute, Chinese Academy of Sciences
Remote SensingComputer Vision and Pattern RecognitionArtificial Intelligence