🤖 AI Summary
This work addresses the performance degradation of models under test-time distribution shifts by proposing an efficient adaptation method based on singular value decomposition (SVD). The approach decouples linear layers in Vision Transformers into fixed singular vectors and learnable singular values, thereby constructing an intrinsic spectral mixture-of-experts architecture. To mitigate feature collapse, a diversity-maximization loss is introduced, and a domain-aware spectral code retrieval mechanism is designed to leverage historical knowledge. Requiring fine-tuning of only 0.26% of the model parameters, the method achieves state-of-the-art performance across multiple distribution shift benchmarks, improving accuracy by 3.4 and 2.4 percentage points in continuous and progressive test-time adaptation scenarios, respectively.
📝 Abstract
Test-time adaptation (TTA) has been widely explored to prevent performance degradation when test data differ from the training distribution. However, fully leveraging the rich representations of large pretrained models with minimal parameter updates remains underexplored. In this paper, we propose Intrinsic Mixture of Spectral Experts (IMSE) that leverages the spectral experts inherently embedded in Vision Transformers. We decompose each linear layer via singular value decomposition (SVD) and adapt only the singular values, while keeping the singular vectors fixed. We further identify a key limitation of entropy minimization in TTA: it often induces feature collapse, causing the model to rely on domain-specific features rather than class-discriminative features. To address this, we propose a diversity maximization loss based on expert-input alignment, which encourages diverse utilization of spectral experts during adaptation. In the continual test-time adaptation (CTTA) scenario, beyond preserving pretrained knowledge, it is crucial to retain and reuse knowledge from previously observed domains. We introduce Domain-Aware Spectral Code Retrieval, which estimates input distributions to detect domain shifts, and retrieves adapted singular values for rapid adaptation. Consequently, our method achieves state-of-the-art performance on various distribution-shift benchmarks under the TTA setting. In CTTA and Gradual CTTA, it further improves accuracy by 3.4 percentage points (pp) and 2.4 pp, respectively, while requiring 385 times fewer trainable parameters. Our code is available at https://github.com/baek85/IMSE.