🤖 AI Summary
Remote sensing image captioning (RSIC) aims to automatically generate accurate, semantically rich natural language descriptions for satellite imagery, supporting applications such as environmental monitoring and disaster assessment. To address the challenge of modeling long-range spatial dependencies and fine-grained semantics inherent in remote sensing images, this paper proposes a novel Mesh Transformer architecture. It introduces, for the first time, a static expansion strategy jointly with a memory-augmented self-attention mechanism within a hierarchical Mesh decoding framework. This design significantly enhances the model’s capacity to capture both global spatial structures and local semantic details. Extensive experiments on the UCM-Caption and NWPU-Caption benchmarks demonstrate that the proposed method consistently outperforms all state-of-the-art approaches across all major evaluation metrics—including BLEU, CIDEr, and SPICE—validating its effectiveness and practical applicability.
📝 Abstract
Image captioning has emerged as a crucial task in the intersection of computer vision and natural language processing, enabling automated generation of descriptive text from visual content. In the context of remote sensing, image captioning plays a significant role in interpreting vast and complex satellite imagery, aiding applications such as environmental monitoring, disaster assessment, and urban planning. This motivates us, in this paper, to present a transformer based network architecture for remote sensing image captioning (RSIC) in which multiple techniques of Static Expansion, Memory-Augmented Self-Attention, Mesh Transformer are evaluated and integrated. We evaluate our proposed models using two benchmark remote sensing image datasets of UCM-Caption and NWPU-Caption. Our best model outperforms the state-of-the-art systems on most of evaluation metrics, which demonstrates potential to apply for real-life remote sensing image systems.