🤖 AI Summary
To address the challenge of adapting vision foundation models to cross-domain semantic segmentation under unsupervised, data-constrained settings, this paper proposes GLARE—a method leveraging continual self-supervised pretraining for efficient domain adaptation. Its core innovations include: (i) local consistency enhancement coupled with spatial-semantic-guided region-level consistency constraints, and (ii) a lightweight UniAdapter module enabling parameter-efficient fine-tuning on ViT backbones. GLARE requires only a small number of unlabeled target-domain images and no downstream annotations, yet achieves substantial performance gains. Evaluated on multiple cross-domain benchmarks (e.g., GTA→Cityscapes), it significantly outperforms existing unsupervised adaptation methods while introducing less than 1% additional parameters and minimal computational overhead. The results demonstrate GLARE’s effectiveness, generalizability, and practicality for real-world deployment under resource-limited conditions.
📝 Abstract
Self-supervised learning (SSL) has emerged as a central paradigm for training foundation models by leveraging large-scale unlabeled datasets, often producing representations with strong generalization capabilities. These models are typically pre-trained on general-purpose datasets such as ImageNet and subsequently adapted to various downstream tasks through finetuning. While recent advances have explored parameter-efficient strategies for adapting pre-trained models, extending SSL pre-training itself to new domains - particularly under limited data regimes and for dense prediction tasks - remains underexplored. In this work, we address the problem of adapting vision foundation models to new domains in an unsupervised and data-efficient manner, specifically targeting downstream semantic segmentation. We propose GLARE (Global Local and Regional Enforcement), a novel continual self-supervised pre-training task designed to enhance downstream segmentation performance. GLARE introduces patch-level augmentations to encourage local consistency and incorporates a regional consistency constraint that leverages spatial semantics in the data. For efficient continual pre-training, we initialize Vision Transformers (ViTs) with weights from existing SSL models and update only lightweight adapter modules - specifically UniAdapter - while keeping the rest of the backbone frozen. Experiments across multiple semantic segmentation benchmarks on different domains demonstrate that GLARE consistently improves downstream performance with minimal computational and parameter overhead.