π€ AI Summary
Vision Transformers (ViTs) achieve strong performance in vision-language models but suffer from poor adversarial robustness; moreover, existing adversarial training methods (e.g., Generalist, DBAT) are incompatible with ViT architectures. Method: This work establishes, for the first time, a theoretical mutual information (MI) bound linking adversarial examples and latent representations in ViT-based autoencoders. We propose a self-supervised adversarial training framework constrained by MI minimization, jointly optimizing masked image modeling (MIM) and autoencoding while incorporating an MI penalty to guide robust pretraining. Contribution/Results: Our method significantly improves robustness against unseen, adaptive, and out-of-distribution attacks. It achieves state-of-the-art performance on CIFAR-10, Tiny-ImageNet, and ImageNet-1Kβsetting a new best robust accuracy on ImageNet-1Kβand demonstrates strong generalization to common image corruptions.
π Abstract
Vision Transformers (ViTs) have emerged as a fundamental architecture and serve as the backbone of modern vision-language models. Despite their impressive performance, ViTs exhibit notable vulnerability to evasion attacks, necessitating the development of specialized Adversarial Training (AT) strategies tailored to their unique architecture. While a direct solution might involve applying existing AT methods to ViTs, our analysis reveals significant incompatibilities, particularly with state-of-the-art (SOTA) approaches such as Generalist (CVPR 2023) and DBAT (USENIX Security 2024). This paper presents a systematic investigation of adversarial robustness in ViTs and provides a novel theoretical Mutual Information (MI) analysis in its autoencoder-based self-supervised pre-training. Specifically, we show that MI between the adversarial example and its latent representation in ViT-based autoencoders should be constrained via derived MI bounds. Building on this insight, we propose a self-supervised AT method, MIMIR, that employs an MI penalty to facilitate adversarial pre-training by masked image modeling with autoencoders. Extensive experiments on CIFAR-10, Tiny-ImageNet, and ImageNet-1K show that MIMIR can consistently provide improved natural and robust accuracy, where MIMIR outperforms SOTA AT results on ImageNet-1K. Notably, MIMIR demonstrates superior robustness against unforeseen attacks and common corruption data and can also withstand adaptive attacks where the adversary possesses full knowledge of the defense mechanism.