Multimodal Feature Fusion Network with Text Difference Enhancement for Remote Sensing Change Detection

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing single-modal remote sensing change detection (RSCD) methods suffer from limited feature representation, coarse-grained change pattern modeling, and poor robustness to illumination variations and noise. To address these bottlenecks, this paper proposes MMChange, a novel multimodal change detection framework that pioneers the integration of textual modality into RSCD. Specifically, it leverages vision-language models to extract semantic image descriptions, introduces a text-difference enhancement module for fine-grained semantic change characterization, and establishes an image-text cross-modal fusion mechanism to enable complementary representation learning. The framework comprises four core components: image feature refinement, text semantic modeling, difference enhancement, and multimodal fusion. Extensive experiments on three benchmark datasets—LEVIR-CD, WHU-CD, and SYSU-CD—demonstrate that MMChange consistently outperforms state-of-the-art methods, achieving significant improvements in mF1 and IoU scores. These results validate the substantial gains in detection accuracy and environmental robustness enabled by multimodal synergy.

Technology Category

Application Category

📝 Abstract
Although deep learning has advanced remote sensing change detection (RSCD), most methods rely solely on image modality, limiting feature representation, change pattern modeling, and generalization especially under illumination and noise disturbances. To address this, we propose MMChange, a multimodal RSCD method that combines image and text modalities to enhance accuracy and robustness. An Image Feature Refinement (IFR) module is introduced to highlight key regions and suppress environmental noise. To overcome the semantic limitations of image features, we employ a vision language model (VLM) to generate semantic descriptions of bitemporal images. A Textual Difference Enhancement (TDE) module then captures fine grained semantic shifts, guiding the model toward meaningful changes. To bridge the heterogeneity between modalities, we design an Image Text Feature Fusion (ITFF) module that enables deep cross modal integration. Extensive experiments on LEVIRCD, WHUCD, and SYSUCD demonstrate that MMChange consistently surpasses state of the art methods across multiple metrics, validating its effectiveness for multimodal RSCD. Code is available at: https://github.com/yikuizhai/MMChange.
Problem

Research questions and friction points this paper is trying to address.

Enhances change detection accuracy using multimodal image-text fusion
Overcomes semantic limitations of image-only methods in remote sensing
Improves robustness against illumination and noise disturbances
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal fusion of image and text
Textual Difference Enhancement captures semantic shifts
Image-Text Feature Fusion bridges modality heterogeneity
🔎 Similar Papers
No similar papers found.
Yijun Zhou
Yijun Zhou
PhD student, The University of Tokyo
Human-Computer Interaction
Y
Yikui Zhai
College of Electronics and Information Engineering, Wuyi University, Jiangmen, 529020, China
Z
Zilu Ying
College of Electronics and Information Engineering, Wuyi University, Jiangmen, 529020, China
T
Tingfeng Xian
College of Electronics and Information Engineering, Wuyi University, Jiangmen, 529020, China
Wenlve Zhou
Wenlve Zhou
The South China University of Techonology
Artificial IntelligenceComputer Vision
Zhiheng Zhou
Zhiheng Zhou
Center for Mind and Brain, University of California, Davis
X
Xiaolin Tian
State Key Laboratory of Lunar and Planetary Sciences, Macau University of Science and Technology, Taipa, Macau
X
Xudong Jia
College of Engineering and Computer Science, California State University, Northridge, 18111, America
Hongsheng Zhang
Hongsheng Zhang
Associate Professor, The University of Hong Kong
GeographyGIS & Remote SensingCoastal SustainabilityUrbanMangrove
C
C. L. Philip Chen
Faculty of Computer Science and Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China