High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the poor robustness, high computational cost, and pretraining knowledge degradation (e.g., due to Dropout) of fine-tuned models under distribution shifts, this paper proposes High-Ratio Mixout: a fine-tuning strategy that replaces up to 90% of learnable parameters with their corresponding pretrained weights—enforcing strong regularization while enabling adaptive parameter updates. Applied dynamically to ViT and ResNet architectures, it balances pretrained knowledge retention and domain generalization. Evaluated on five standard out-of-distribution benchmarks (PACS, VLCS, etc.), our method achieves single-model performance competitive with ensemble approaches, reduces gradient computation by 45%, and cuts GPU memory usage by 90%, significantly improving training efficiency. Its core contribution lies in the first systematic investigation of high-masking-rate Mixout for robust fine-tuning, revealing its mechanism for enhancing generalization. The result is a lightweight, efficient, and plug-and-play paradigm for out-of-distribution generalization.

Technology Category

Application Category

📝 Abstract
Ensembling fine-tuned models initialized from powerful pre-trained weights is a common strategy to improve robustness under distribution shifts, but it comes with substantial computational costs due to the need to train and store multiple models. Dropout offers a lightweight alternative by simulating ensembles through random neuron deactivation; however, when applied to pre-trained models, it tends to over-regularize and disrupt critical representations necessary for generalization. In this work, we investigate Mixout, a stochastic regularization technique that provides an alternative to Dropout for domain generalization. Rather than deactivating neurons, Mixout mitigates overfitting by probabilistically swapping a subset of fine-tuned weights with their pre-trained counterparts during training, thereby maintaining a balance between adaptation and retention of prior knowledge. Our study reveals that achieving strong performance with Mixout on domain generalization benchmarks requires a notably high masking probability of 0.9 for ViTs and 0.8 for ResNets. While this may seem like a simple adjustment, it yields two key advantages for domain generalization: (1) higher masking rates more strongly penalize deviations from the pre-trained parameters, promoting better generalization to unseen domains; and (2) high-rate masking substantially reduces computational overhead, cutting gradient computation by up to 45% and gradient memory usage by up to 90%. Experiments across five domain generalization benchmarks, PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet, using ResNet and ViT architectures, show that our approach, High-rate Mixout, achieves out-of-domain accuracy comparable to ensemble-based methods while significantly reducing training costs.
Problem

Research questions and friction points this paper is trying to address.

Improving domain generalization without expensive ensemble training costs
Preventing over-regularization when applying dropout to pre-trained models
Balancing adaptation and retention of prior knowledge during fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixout swaps fine-tuned weights with pre-trained ones
High masking probability improves generalization to unseen domains
Reduces computational costs by cutting gradient usage significantly
🔎 Similar Papers
No similar papers found.