🤖 AI Summary
Thaw slumps—rapid land-surface failures triggered by Arctic permafrost degradation—are characterized by small spatial scales, ill-defined boundaries, and strong spatiotemporal dynamics, posing significant challenges for accurate remote sensing detection and mapping. To address these challenges, we propose a high-precision thaw slump detection framework leveraging multi-source remote sensing data (optical, SAR, and DEM). Our method introduces a residual cross-modal attention fusion mechanism to achieve complementary feature-level integration; adopts a “single-modal pretraining + multi-modal fine-tuning” paradigm to balance model performance and computational efficiency; and incorporates a multi-scale Vision Transformer (ViT) backbone into the Cascade Mask R-CNN architecture. Extensive experiments demonstrate that our approach substantially outperforms data-level fusion, CNN-based feature fusion, and state-of-the-art attention-based fusion methods. It achieves state-of-the-art accuracy in pan-Arctic thaw slump mapping, providing a robust technical foundation for permafrost degradation monitoring and environmental assessment.
📝 Abstract
Retrogressive Thaw Slumps (RTS) in Arctic regions are distinct permafrost landforms with significant environmental impacts. Mapping these RTS is crucial because their appearance serves as a clear indication of permafrost thaw. However, their small scale compared to other landform features, vague boundaries, and spatiotemporal variation pose significant challenges for accurate detection. In this paper, we employed a state-of-the-art deep learning model, the Cascade Mask R-CNN with a multi-scale vision transformer-based backbone, to delineate RTS features across the Arctic. Two new strategies were introduced to optimize multimodal learning and enhance the model's predictive performance: (1) a feature-level, residual cross-modality attention fusion strategy, which effectively integrates feature maps from multiple modalities to capture complementary information and improve the model's ability to understand complex patterns and relationships within the data; (2) pre-trained unimodal learning followed by multimodal fine-tuning to alleviate high computing demand while achieving strong model performance. Experimental results demonstrated that our approach outperformed existing models adopting data-level fusion, feature-level convolutional fusion, and various attention fusion strategies, providing valuable insights into the efficient utilization of multimodal data for RTS mapping. This research contributes to our understanding of permafrost landforms and their environmental implications.