π€ AI Summary
This work addresses the vulnerability of vision-language models such as CLIP to imperceptible adversarial perturbations and the high computational cost of existing test-time defenses. The authors propose a training-free, efficient defense mechanism that leverages data augmentation to generate features with robust geometric consistency as anchors. In CLIPβs hyperspherical feature space, input features are adaptively corrected along geodesics toward these anchors. This approach is the first to reveal the intrinsic geometric regularities of CLIP features on the hypersphere and employs an adaptive step size to balance robustness and clean accuracy. Experiments across eight fine-grained datasets and three CLIP backbones demonstrate an average improvement of 44.4% in robust accuracy and a tenfold reduction in inference latency, significantly outperforming state-of-the-art methods.
π Abstract
Vision-language models like CLIP have demonstrated remarkable zero-shot transfer capabilities. However, their susceptibility to imperceptible adversarial perturbations remains a critical security concern. While test-time defenses offer a pragmatic solution for deployed models, existing approaches typically rely on gradient-based optimization during inference, incurring significant computational overhead. In this paper, we revisit the role of data augmentation in CLIP robustness and observe that augmentations are not equally effective: specific augmentations consistently provide robust geometric cues that align with correct class semantics in the hyperspherical feature space. Based on this, we propose Adaptive Geodesic Correction (AGC), a training-free defense mechanism that requires no parameter updates. AGC identifies a reliable augmentation as a geometric anchor and corrects the input feature towards it, utilizing an adaptive step size to balance robustness against clean accuracy preservation. AGC achieves superior performance across eight fine-grained datasets and three CLIP backbones, improving average robust accuracy by 44.4\% over state-of-the-art baseline while delivering a 10$\times$ reduction in inference latency. Our findings reveal a fundamental geometric property of CLIP features, offering a highly efficient and effective paradigm for robust multimodal deployment.