UniDiff: Parameter-Efficient Adaptation of Diffusion Models for Land Cover Classification with Multi-Modal Remotely Sensed Imagery and Sparse Annotations

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address the critical bottleneck of extreme label scarcity in semantic segmentation of multimodal remote sensing imagery (e.g., hyperspectral imaging—HSI—and synthetic aperture radar—SAR), this paper proposes UniDiff: a diffusion-based framework that enables cross-modal adaptation using only unlabeled target-domain data, leveraging an ImageNet-pretrained diffusion model. Its key contributions are: (1) a FiLM-based mechanism that jointly conditions diffusion timesteps and modality identifiers to unify heterogeneous input representations; (2) a pseudo-RGB anchoring strategy to bridge the modality gap; and (3) low-rank fine-tuning—adjusting only ~5% of parameters—to ensure efficiency and robustness against catastrophic forgetting. Evaluated on two multimodal remote sensing benchmarks, UniDiff substantially reduces annotation dependency and achieves state-of-the-art segmentation performance even under extremely sparse labeling, while improving the quality of multi-source remote sensing feature fusion.

Technology Category

Application Category

📝 Abstract

Sparse annotations fundamentally constrain multimodal remote sensing: even recent state-of-the-art supervised methods such as MSFMamba are limited by the availability of labeled data, restricting their practical deployment despite architectural advances. ImageNet-pretrained models provide rich visual representations, but adapting them to heterogeneous modalities such as hyperspectral imaging (HSI) and synthetic aperture radar (SAR) without large labeled datasets remains challenging. We propose UniDiff, a parameter-efficient framework that adapts a single ImageNet-pretrained diffusion model to multiple sensing modalities using only target-domain data. UniDiff combines FiLM-based timestep-modality conditioning, parameter-efficient adaptation of approximately 5% of parameters, and pseudo-RGB anchoring to preserve pre-trained representations and prevent catastrophic forgetting. This design enables effective feature extraction from remote sensing data under sparse annotations. Our results with two established multi-modal benchmarking datasets demonstrate that unsupervised adaptation of a pre-trained diffusion model effectively mitigates annotation constraints and achieves effective fusion of multi-modal remotely sensed data.

Problem

Research questions and friction points this paper is trying to address.

Adapts diffusion models to multi-modal remote sensing with sparse annotations

Enables effective feature extraction from heterogeneous modalities like HSI and SAR

Mitigates annotation constraints for practical deployment of remote sensing methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts diffusion model to multiple sensing modalities

Uses parameter-efficient adaptation of 5% parameters

Combines FiLM conditioning and pseudo-RGB anchoring

🔎 Similar Papers

Hierarchical Attention Diffusion Networks with Object Priors for Video Change Detection