Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation

📅 2026-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of effectively modeling cross-modal semantic consistency and capturing unified global representations in Earth observation pretraining. To this end, we propose a dual-teacher contrastive distillation framework that, for the first time, introduces contrastive self-distillation into multispectral remote sensing. Our approach leverages both a multispectral-specific model and a general-purpose vision foundation model as teachers to guide the student in learning consistent semantic representations across optical and multispectral data. This is further enhanced by an improved masked image modeling strategy and a cross-modal alignment mechanism. Extensive experiments demonstrate consistent improvements of 3.64%, 1.20%, and 1.31% on average over state-of-the-art methods in semantic segmentation, change detection, and classification tasks, respectively, establishing new performance benchmarks in multimodal remote sensing scenarios.

Technology Category

Application Category

📝 Abstract
Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic. Multiple specialized EO foundation models (EOFMs) will likely coexist, making efficient knowledge transfer across modalities essential. Most existing EO pretraining relies on masked image modeling, which emphasizes local reconstruction but provides limited control over global semantic structure. To address this, we propose a dual-teacher contrastive distillation framework for multispectral imagery that aligns the student's pretraining objective with the contrastive self-distillation paradigm of modern optical vision foundation models (VFMs). Our approach combines a multispectral teacher with an optical VFM teacher, enabling coherent cross-modal representation learning. Experiments across diverse optical and multispectral benchmarks show that our model adapts to multispectral data without compromising performance on optical-only inputs, achieving state-of-the-art results in both settings, with an average improvement of 3.64 percentage points in semantic segmentation, 1.2 in change detection, and 1.31 in classification tasks. This demonstrates that contrastive distillation provides a principled and efficient approach to scalable representation learning across heterogeneous EO data sources. Code: Coming soon.
Problem

Research questions and friction points this paper is trying to address.

Earth Observation
multispectral imagery
foundation models
cross-modal representation
knowledge distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-teacher distillation
contrastive distillation
multispectral Earth Observation
foundation models
cross-modal representation learning
🔎 Similar Papers
No similar papers found.
F
Filip Wolf
University of Ljubljana, Faculty of Computer and Information Science, Slovenia
B
Blaž Rolih
University of Ljubljana, Faculty of Computer and Information Science, Slovenia
Luka Čehovin Zajc
Luka Čehovin Zajc
Assistant Professor at the Faculty of Computer and Information Science, University of Ljubljana
Computer VisionMachine LearningRemote SensingHCI