SEMT: Static-Expansion-Mesh Transformer Network Architecture for Remote Sensing Image Captioning

📅 2025-07-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Remote sensing image captioning (RSIC) aims to automatically generate accurate, semantically rich natural language descriptions for satellite imagery, supporting applications such as environmental monitoring and disaster assessment. To address the challenge of modeling long-range spatial dependencies and fine-grained semantics inherent in remote sensing images, this paper proposes a novel Mesh Transformer architecture. It introduces, for the first time, a static expansion strategy jointly with a memory-augmented self-attention mechanism within a hierarchical Mesh decoding framework. This design significantly enhances the model’s capacity to capture both global spatial structures and local semantic details. Extensive experiments on the UCM-Caption and NWPU-Caption benchmarks demonstrate that the proposed method consistently outperforms all state-of-the-art approaches across all major evaluation metrics—including BLEU, CIDEr, and SPICE—validating its effectiveness and practical applicability.

Technology Category

Application Category

📝 Abstract
Image captioning has emerged as a crucial task in the intersection of computer vision and natural language processing, enabling automated generation of descriptive text from visual content. In the context of remote sensing, image captioning plays a significant role in interpreting vast and complex satellite imagery, aiding applications such as environmental monitoring, disaster assessment, and urban planning. This motivates us, in this paper, to present a transformer based network architecture for remote sensing image captioning (RSIC) in which multiple techniques of Static Expansion, Memory-Augmented Self-Attention, Mesh Transformer are evaluated and integrated. We evaluate our proposed models using two benchmark remote sensing image datasets of UCM-Caption and NWPU-Caption. Our best model outperforms the state-of-the-art systems on most of evaluation metrics, which demonstrates potential to apply for real-life remote sensing image systems.
Problem

Research questions and friction points this paper is trying to address.

Develop transformer network for remote sensing image captioning
Integrate Static Expansion and Mesh Transformer techniques
Improve accuracy in satellite image description tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Static Expansion technique for RSIC
Memory-Augmented Self-Attention mechanism
Mesh Transformer network architecture
🔎 Similar Papers
No similar papers found.
K
Khang Truong
Ho Chi Minh University of Technology, Vietnam
Lam Pham
Lam Pham
Data Scientist in Austrian Institute of Technology
VLSI DesignSignal ProcessingDeep LearningMultimodal
H
Hieu Tang
University of Technology of Troyes, France
Jasmin Lampert
Jasmin Lampert
Senior Scientist at AIT Austrian Institute of Technology
Green Data ScienceGeospatial AnalyticsPhysics-informed Machine LearningEnvironmental MonitoringCrisis Management
M
Martin Boyer
Austrian Institute of Technology, Vienna, Austria
S
Son Phan
Ton Duc Thang University, Vietnam
T
Truong Nguyen
Ho Chi Minh University of Technology, Vietnam