🤖 AI Summary
To address the lack of a universal objective in test-time adaptation (TTA), this paper proposes a unified TTA framework applicable to image-, object-, and pixel-level classification and regression tasks. Methodologically, it is the first to characterize the spatial-frequency power decay under distribution shift and introduces a Fourier-domain low-frequency amplitude random masking scheme coupled with high-frequency noise compensation—enabling generation of strongly consistent augmentation pairs while preserving geometric structures (e.g., object scale and position). The approach integrates frequency-domain degradation, self-guided consistency regularization, and label-free self-supervised optimization, and is architecture-agnostic, supporting both CNNs and Transformers. Extensive experiments demonstrate significant improvements over state-of-the-art methods on image classification, semantic segmentation, and monocular 3D object detection. The framework is plug-and-play, and exhibits strong generalization across multiple granularities and modalities.
📝 Abstract
In this paper, we seek to develop a versatile test-time adaptation (TTA) objective for a variety of tasks - classification and regression across image-, object-, and pixel-level predictions. We achieve this through a self-bootstrapping scheme that optimizes prediction consistency between the test image (as target) and its deteriorated view. The key challenge lies in devising effective augmentations/deteriorations that: i) preserve the image's geometric information, e.g., object sizes and locations, which is crucial for TTA on object/pixel-level tasks, and ii) provide sufficient learning signals for TTA. To this end, we analyze how common distribution shifts affect the image's information power across spatial frequencies in the Fourier domain, and reveal that low-frequency components carry high power and masking these components supplies more learning signals, while masking high-frequency components can not. In light of this, we randomly mask the low-frequency amplitude of an image in its Fourier domain for augmentation. Meanwhile, we also augment the image with noise injection to compensate for missing learning signals at high frequencies, by enhancing the information power there. Experiments show that, either independently or as a plug-and-play module, our method achieves superior results across classification, segmentation, and 3D monocular detection tasks with both transformer and CNN models.