Language-guided Learning for Object Detection Tackling Multiple Variations in Aerial Images

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address degraded object detection performance in aerial imagery caused by variable illumination, viewing angles, and other scene factors, this paper proposes LANGO, a language-guided dual-granularity detection framework. LANGO introduces a novel scene-level and instance-level decoupled modeling mechanism: a visual-semantic reasoning module captures scene conditions, while a CLIP-inspired cross-modal representation distillation strategy—implemented via a relational learning loss—explicitly encodes linguistic semantic relationships among categories, thereby enhancing robustness to scale, pose, and large-angle viewpoint variations. By integrating multi-scale feature alignment with contrastive language–vision relational learning, LANGO achieves consistent improvements of 3.2–5.7% in mean Average Precision (mAP) on the DOTA and HRSC benchmarks, significantly mitigating missed and false detections under challenging illumination and extreme viewing angles.

Technology Category

Application Category

📝 Abstract
Despite recent advancements in computer vision research, object detection in aerial images still suffers from several challenges. One primary challenge to be mitigated is the presence of multiple types of variation in aerial images, for example, illumination and viewpoint changes. These variations result in highly diverse image scenes and drastic alterations in object appearance, so that it becomes more complicated to localize objects from the whole image scene and recognize their categories. To address this problem, in this paper, we introduce a novel object detection framework in aerial images, named LANGuage-guided Object detection (LANGO). Upon the proposed language-guided learning, the proposed framework is designed to alleviate the impacts from both scene and instance-level variations. First, we are motivated by the way humans understand the semantics of scenes while perceiving environmental factors in the scenes (e.g., weather). Therefore, we design a visual semantic reasoner that comprehends visual semantics of image scenes by interpreting conditions where the given images were captured. Second, we devise a training objective, named relation learning loss, to deal with instance-level variations, such as viewpoint angle and scale changes. This training objective aims to learn relations in language representations of object categories, with the help of the robust characteristics against such variations. Through extensive experiments, we demonstrate the effectiveness of the proposed method, and our method obtains noticeable detection performance improvements.
Problem

Research questions and friction points this paper is trying to address.

Addressing multiple variations in aerial images
Improving object detection under diverse conditions
Mitigating scene and instance-level appearance changes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-guided learning for aerial object detection
Visual semantic reasoner interprets image conditions
Relation learning loss handles instance variations
🔎 Similar Papers
No similar papers found.
Sungjune Park
Sungjune Park
Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST)
Deep learningMachine learningObject detection
H
Hyunjun Kim
Image and Video Systems Lab., School of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291 Daehak-ro, Yuseong-gu, Daejeon, 34141, Republic of Korea
B
Beomchan Park
Image and Video Systems Lab., School of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291 Daehak-ro, Yuseong-gu, Daejeon, 34141, Republic of Korea
Yong Man Ro
Yong Man Ro
Professor of Electrical Engineering, KAIST, ICT Endowed Chair Professor
Multimodal learningVision Language integrationImage processing and Computer vision