SEMT: Static-Expansion-Mesh Transformer Network Architecture for Remote Sensing Image Captioning

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Remote sensing image captioning (RSIC) aims to automatically generate accurate, semantically rich natural language descriptions for satellite imagery, supporting applications such as environmental monitoring and disaster assessment. To address the challenge of modeling long-range spatial dependencies and fine-grained semantics inherent in remote sensing images, this paper proposes a novel Mesh Transformer architecture. It introduces, for the first time, a static expansion strategy jointly with a memory-augmented self-attention mechanism within a hierarchical Mesh decoding framework. This design significantly enhances the model’s capacity to capture both global spatial structures and local semantic details. Extensive experiments on the UCM-Caption and NWPU-Caption benchmarks demonstrate that the proposed method consistently outperforms all state-of-the-art approaches across all major evaluation metrics—including BLEU, CIDEr, and SPICE—validating its effectiveness and practical applicability.

Technology Category

Application Category

📝 Abstract

Image captioning has emerged as a crucial task in the intersection of computer vision and natural language processing, enabling automated generation of descriptive text from visual content. In the context of remote sensing, image captioning plays a significant role in interpreting vast and complex satellite imagery, aiding applications such as environmental monitoring, disaster assessment, and urban planning. This motivates us, in this paper, to present a transformer based network architecture for remote sensing image captioning (RSIC) in which multiple techniques of Static Expansion, Memory-Augmented Self-Attention, Mesh Transformer are evaluated and integrated. We evaluate our proposed models using two benchmark remote sensing image datasets of UCM-Caption and NWPU-Caption. Our best model outperforms the state-of-the-art systems on most of evaluation metrics, which demonstrates potential to apply for real-life remote sensing image systems.

Problem

Research questions and friction points this paper is trying to address.

Develop transformer network for remote sensing image captioning

Integrate Static Expansion and Mesh Transformer techniques

Improve accuracy in satellite image description tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Static Expansion technique for RSIC

Memory-Augmented Self-Attention mechanism

Mesh Transformer network architecture

🔎 Similar Papers

No similar papers found.