đ¤ AI Summary
To address the significant domain shift between synthetic aperture radar (SAR) remote sensing imagery and ImageNet natural images, this work proposes two self-supervised multimodal pretraining strategies and introduces a novel hybrid architecture integrating a Swin Transformer encoder with a residual CNN decoder, specifically designed for precise glacier calving front localization. The method eliminates reliance on ImageNet-supervised pretraining, enabling year-round, uninterrupted SAR monitoring. On the CaFFe benchmark, the single model achieves a mean distance error of 293 mâimproving upon the state-of-the-art by 67 mâwhile ensemble inference further reduces the error to 75 m, approaching human annotation accuracy. This is the first study to systematically integrate self-supervised multimodal pretraining with a Swin-CNN hybrid architecture for glacier front extraction, effectively mitigating domain shift bottlenecks and establishing a new paradigm for intelligent interpretation of polar remote sensing data.
đ Abstract
Glaciers are losing ice mass at unprecedented rates, increasing the need for accurate, year-round monitoring to understand frontal ablation, particularly the factors driving the calving process. Deep learning models can extract calving front positions from Synthetic Aperture Radar imagery to track seasonal ice losses at the calving fronts of marine- and lake-terminating glaciers. The current state-of-the-art model relies on ImageNet-pretrained weights. However, they are suboptimal due to the domain shift between the natural images in ImageNet and the specialized characteristics of remote sensing imagery, in particular for Synthetic Aperture Radar imagery. To address this challenge, we propose two novel self-supervised multimodal pretraining techniques that leverage SSL4SAR, a new unlabeled dataset comprising 9,563 Sentinel-1 and 14 Sentinel-2 images of Arctic glaciers, with one optical image per glacier in the dataset. Additionally, we introduce a novel hybrid model architecture that combines a Swin Transformer encoder with a residual Convolutional Neural Network (CNN) decoder. When pretrained on SSL4SAR, this model achieves a mean distance error of 293 m on the "CAlving Fronts and where to Find thEm" (CaFFe) benchmark dataset, outperforming the prior best model by 67 m. Evaluating an ensemble of the proposed model on a multi-annotator study of the benchmark dataset reveals a mean distance error of 75 m, approaching the human performance of 38 m. This advancement enables precise monitoring of seasonal changes in glacier calving fronts.