Selective, Regularized, and Calibrated: Harnessing Vision Foundation Models for Cross-Domain Few-Shot Semantic Segmentation

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the challenges of overfitting and feature inconsistency in cross-domain few-shot semantic segmentation, which arise from scarce annotations and domain shift. To tackle these issues, the authors propose HERA, a lightweight test-time adaptation framework that operates under the constraint of keeping the foundational vision model frozen. HERA innovatively integrates hierarchical exemplar representation adaptation, a data-dependent layer selection (HLS) mechanism, prior-guided regularization (PGR), and pixel-wise adaptive calibration (PAC). Remarkably, it achieves effective domain adaptation by fine-tuning fewer than 2.7% of the model parameters. Evaluated on multiple cross-domain few-shot segmentation benchmarks, HERA consistently outperforms existing methods, delivering an average improvement of over 4.1 mIoU.

📝 Abstract

Vision foundation models (VFMs) have achieved strong performance across various vision tasks. However, it still remains challenging to apply VFMs for cross-domain few-shot segmentation (CD-FSS), which segments objects of novel classes under domain shifts using only a few labeled exemplars. The challenge is mainly driven by two factors: (1) limited labeled exemplars per novel class relative to the scale of VFM pre-training, making the model prone to overfitting during retraining, and (2) target-domain shifts underrepresented during pre-training, inducing cross-domain inconsistency and layer-wise sensitivity. To address these issues, we propose Hierarchical Exemplar Representation Adaptation (HERA), a three-stage select-regularize-calibrate VFM-based segmentation framework that learns effectively from limited labels and adapts to novel domains without source-data retraining. We first design Hierarchical Layer Selection (HLS) to adaptively identify the most informative VFM layer using a data-dependent Exemplar Transfer Risk (ETR) computed for each candidate layer. Then, Prior-Guided Regularization (PGR) regularizes interactions on the selected representation, yielding well-structured local signals for the subsequent stage. Furthermore, Pixelwise Adaptive Calibration (PAC) combines the selected representation with the refined interaction maps to calibrate pixel-wise predictions, producing consistent masks. Together, these stages form a hierarchical select-regularize-calibrate pipeline that guides frozen VFM features in new domains while fine-tuning less than 2.7% of parameters at test time. Extensive experiments show that HERA surpasses the state of the art by more than 4.1 mIoU across multiple CD-FSS benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Cross-Domain Few-Shot Semantic Segmentation

Vision Foundation Models

Domain Shift

Few-Shot Learning

Semantic Segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Foundation Models

Cross-Domain Few-Shot Segmentation

Hierarchical Exemplar Representation Adaptation