🤖 AI Summary
This work investigates whether generative diffusion models can serve as high-discriminative geospatial foundation models (GFMs) for remote sensing tasks. To this end, we propose SatDiFuser—a novel framework introducing a noise-dependent multi-stage feature analysis mechanism, which bridges generative pretraining and discriminative downstream tasks via three noise-adaptive fusion strategies. Crucially, our approach abandons conventional fine-tuning paradigms and directly repurposes diffusion models as efficient discriminative representation learners. We systematically evaluate generative models’ discriminative capabilities across multiple remote sensing benchmarks for semantic segmentation and image classification. Experimental results demonstrate substantial improvements: +5.7% mIoU in semantic segmentation and +7.9% F1-score in image classification—significantly surpassing state-of-the-art GFMs. This study establishes the first evidence that diffusion-based pretraining, when properly adapted, yields highly effective discriminative representations for geospatial vision tasks.
📝 Abstract
Self-supervised learning (SSL) has revolutionized representation learning in Remote Sensing (RS), advancing Geospatial Foundation Models (GFMs) to leverage vast unlabeled satellite imagery for diverse downstream tasks. Currently, GFMs primarily focus on discriminative objectives, such as contrastive learning or masked image modeling, owing to their proven success in learning transferable representations. However, generative diffusion models--which demonstrate the potential to capture multi-grained semantics essential for RS tasks during image generation--remain underexplored for discriminative applications. This prompts the question: can generative diffusion models also excel and serve as GFMs with sufficient discriminative power? In this work, we answer this question with SatDiFuser, a framework that transforms a diffusion-based generative geospatial foundation model into a powerful pretraining tool for discriminative RS. By systematically analyzing multi-stage, noise-dependent diffusion features, we develop three fusion strategies to effectively leverage these diverse representations. Extensive experiments on remote sensing benchmarks show that SatDiFuser outperforms state-of-the-art GFMs, achieving gains of up to +5.7% mIoU in semantic segmentation and +7.9% F1-score in classification, demonstrating the capacity of diffusion-based generative foundation models to rival or exceed discriminative GFMs. Code will be released.