Causal-Tune: Mining Causal Factors from Vision Foundation Models for Domain Generalized Semantic Segmentation

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing lightweight fine-tuning methods for vision foundation models (VFMs) in domain-generalized semantic segmentation (DGSS) suffer from non-causal spurious correlations introduced during pretraining, undermining cross-domain robustness. This work pioneers the integration of causal inference into VFM spectral analysis, proposing a frequency-domain causal disentanglement framework. Specifically, feature maps are transformed into the frequency domain via discrete cosine transform (DCT); a Gaussian band-pass filter and causal-aware learnable tokens are then employed to explicitly isolate and suppress non-causal high-frequency artifacts while preserving causal low-frequency representations. Inverse DCT followed by lightweight adapter-based adaptation enables efficient fine-tuning. Evaluated on multiple cross-domain benchmarks, the method achieves significant generalization gains—e.g., +4.8% mIoU under snowy conditions over strong baselines—demonstrating improved robustness and generalizability without sacrificing efficiency.

Technology Category

Application Category

📝 Abstract
Fine-tuning Vision Foundation Models (VFMs) with a small number of parameters has shown remarkable performance in Domain Generalized Semantic Segmentation (DGSS). Most existing works either train lightweight adapters or refine intermediate features to achieve better generalization on unseen domains. However, they both overlook the fact that long-term pre-trained VFMs often exhibit artifacts, which hinder the utilization of valuable representations and ultimately degrade DGSS performance. Inspired by causal mechanisms, we observe that these artifacts are associated with non-causal factors, which usually reside in the low- and high-frequency components of the VFM spectrum. In this paper, we explicitly examine the causal and non-causal factors of features within VFMs for DGSS, and propose a simple yet effective method to identify and disentangle them, enabling more robust domain generalization. Specifically, we propose Causal-Tune, a novel fine-tuning strategy designed to extract causal factors and suppress non-causal ones from the features of VFMs. First, we extract the frequency spectrum of features from each layer using the Discrete Cosine Transform (DCT). A Gaussian band-pass filter is then applied to separate the spectrum into causal and non-causal components. To further refine the causal components, we introduce a set of causal-aware learnable tokens that operate in the frequency domain, while the non-causal components are discarded. Finally, refined features are transformed back into the spatial domain via inverse DCT and passed to the next layer. Extensive experiments conducted on various cross-domain tasks demonstrate the effectiveness of Causal-Tune. In particular, our method achieves superior performance under adverse weather conditions, improving +4.8% mIoU over the baseline in snow conditions.
Problem

Research questions and friction points this paper is trying to address.

Identifies causal factors in vision models for domain generalization
Separates causal and non-causal features using frequency analysis
Improves semantic segmentation robustness in unseen domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses DCT and Gaussian filter to separate causal factors
Introduces causal-aware tokens in frequency domain
Suppresses non-causal artifacts for robust generalization
🔎 Similar Papers
No similar papers found.
Y
Yin Zhang
School of Instrument Science and Engineering, Harbin Institute of Technology, Harbin, China
Yongqiang Zhang
Yongqiang Zhang
Distinguished Professor, Institute of Geographic Sciences and Natural Resources Research, CAS
evapotranspirationhydrologyremote sensingclimate changewater resources
Y
Yaoyue Zheng
Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Shaanxi, China
B
Bogdan Raducanu
Computer Vision Center, Universitat Autònoma de Barcelona, Barcelona, Spain
D
Dan Liu
School of Instrument Science and Engineering, Harbin Institute of Technology, Harbin, China