A multi-scale vision transformer-based multimodal GeoAI model for mapping Arctic permafrost thaw

📅 2025-04-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Thaw slumps—rapid land-surface failures triggered by Arctic permafrost degradation—are characterized by small spatial scales, ill-defined boundaries, and strong spatiotemporal dynamics, posing significant challenges for accurate remote sensing detection and mapping. To address these challenges, we propose a high-precision thaw slump detection framework leveraging multi-source remote sensing data (optical, SAR, and DEM). Our method introduces a residual cross-modal attention fusion mechanism to achieve complementary feature-level integration; adopts a “single-modal pretraining + multi-modal fine-tuning” paradigm to balance model performance and computational efficiency; and incorporates a multi-scale Vision Transformer (ViT) backbone into the Cascade Mask R-CNN architecture. Extensive experiments demonstrate that our approach substantially outperforms data-level fusion, CNN-based feature fusion, and state-of-the-art attention-based fusion methods. It achieves state-of-the-art accuracy in pan-Arctic thaw slump mapping, providing a robust technical foundation for permafrost degradation monitoring and environmental assessment.

Technology Category

Application Category

📝 Abstract
Retrogressive Thaw Slumps (RTS) in Arctic regions are distinct permafrost landforms with significant environmental impacts. Mapping these RTS is crucial because their appearance serves as a clear indication of permafrost thaw. However, their small scale compared to other landform features, vague boundaries, and spatiotemporal variation pose significant challenges for accurate detection. In this paper, we employed a state-of-the-art deep learning model, the Cascade Mask R-CNN with a multi-scale vision transformer-based backbone, to delineate RTS features across the Arctic. Two new strategies were introduced to optimize multimodal learning and enhance the model's predictive performance: (1) a feature-level, residual cross-modality attention fusion strategy, which effectively integrates feature maps from multiple modalities to capture complementary information and improve the model's ability to understand complex patterns and relationships within the data; (2) pre-trained unimodal learning followed by multimodal fine-tuning to alleviate high computing demand while achieving strong model performance. Experimental results demonstrated that our approach outperformed existing models adopting data-level fusion, feature-level convolutional fusion, and various attention fusion strategies, providing valuable insights into the efficient utilization of multimodal data for RTS mapping. This research contributes to our understanding of permafrost landforms and their environmental implications.
Problem

Research questions and friction points this paper is trying to address.

Mapping small-scale Arctic permafrost thaw landforms (RTS)
Overcoming vague boundaries and spatiotemporal variation in RTS detection
Optimizing multimodal learning for accurate permafrost thaw mapping
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale vision transformer backbone for RTS detection
Residual cross-modality attention fusion strategy
Pre-trained unimodal learning with multimodal fine-tuning
🔎 Similar Papers
No similar papers found.
W
Wenwen Li
School of Geographical Sciences and Urban Planning, Arizona State University
C
Chia-Yu Hsu
School of Geographical Sciences and Urban Planning, Arizona State University
Sizhe Wang
Sizhe Wang
Washington University in Saint Louis
LLM NLP
Zhining Gu
Zhining Gu
Arizona State University
GISDeep LearningMachine Learning
Yili Yang
Yili Yang
Woodwell Climate Research Center
B
Brendan M. Rogers
Woodwell Climate Research Center
A
A. Liljedahl
Woodwell Climate Research Center