AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions

📅 2025-04-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient accuracy and cross-domain generalization in aerial image captioning—caused by complex spatial semantics and large domain discrepancies—this paper proposes a lightweight label-guided end-to-end framework. Methodologically: (1) we introduce the first multi-label CLIP encoder for fine-grained remote sensing semantic modeling; (2) we design a bridging MLP module to align visual and semantic representations; and (3) we propose a two-stage LoRA fine-tuning paradigm that jointly optimizes broad semantic learning and domain-specific adaptation. High-quality training data are constructed via GPT-4o pseudo-labeling and NLP-based label extraction, and the framework integrates compact language models (1–3B parameters). Experiments demonstrate superior performance over 13B-parameter baselines across mainstream metrics, with >80% reduction in inference cost, significantly enhanced interpretability, deployment efficiency, and cross-domain generalization capability.

Technology Category

Application Category

📝 Abstract
Accurate and automated captioning of aerial imagery is crucial for applications like environmental monitoring, urban planning, and disaster management. However, this task remains challenging due to complex spatial semantics and domain variability. To address these issues, we introduce extbf{AeroLite}, a lightweight, tag-guided captioning framework designed to equip small-scale language models (1--3B parameters) with robust and interpretable captioning capabilities specifically for remote sensing images. extbf{AeroLite} leverages GPT-4o to generate a large-scale, semantically rich pseudo-caption dataset by integrating multiple remote sensing benchmarks, including DLRSD, iSAID, LoveDA, WHU, and RSSCN7. To explicitly capture key semantic elements such as orientation and land-use types, AeroLite employs natural language processing techniques to extract relevant semantic tags. These tags are then learned by a dedicated multi-label CLIP encoder, ensuring precise semantic predictions. To effectively fuse visual and semantic information, we propose a novel bridging multilayer perceptron (MLP) architecture, aligning semantic tags with visual embeddings while maintaining minimal computational overhead. AeroLite's flexible design also enables seamless integration with various pretrained large language models. We adopt a two-stage LoRA-based training approach: the initial stage leverages our pseudo-caption dataset to capture broad remote sensing semantics, followed by fine-tuning on smaller, curated datasets like UCM and Sydney Captions to refine domain-specific alignment. Experimental evaluations demonstrate that AeroLite surpasses significantly larger models (e.g., 13B parameters) in standard captioning metrics, including BLEU and METEOR, while maintaining substantially lower computational costs.
Problem

Research questions and friction points this paper is trying to address.

Automated captioning of aerial images is challenging due to complex spatial semantics.
Existing models struggle with domain variability in remote sensing imagery.
Small-scale language models lack robust captioning capabilities for aerial images.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight tag-guided captioning framework
GPT-4o generates pseudo-caption dataset
Novel bridging MLP aligns tags visually
🔎 Similar Papers
No similar papers found.
Xing Zi
Xing Zi
Researcher, University of Technology Sydney
Computer VisionRemote SensingMultimodal
T
Tengjun Ni
University of New South Wales, Sydney, Australia
X
Xianjing Fan
University of Technology Sydney, Sydney, Australia
X
Xian Tao
Institute of Automation, Chinese Academy of Sciences, Beijing, China
J
Jun Li
University of Technology Sydney, Sydney, Australia
Ali Braytee
Ali Braytee
University of Technology Sydney
machine learningoptimizationdata miningcomputational biology
M
Mukesh Prasad
University of Technology Sydney, Sydney, Australia