๐ค AI Summary
In randomized smoothing, employing pre-trained diffusion denoising models introduces covariate shift due to biased noise estimation, degrading certified robustness. This shift arises because the standard denoising objective is misaligned with the actual noise distribution induced by the smoothing process. To address this, we propose an adversarial training objective tailored to the diffusion process: adversarial perturbations are injected during the noise addition stage, explicitly adapting the base classifier to the true smoothed noise distribution. Crucially, our method requires no architectural modifications to the denoising model nor retraining of the diffusion modelโcovariate shift is mitigated at its source solely by reformulating the training objective. Evaluated on MNIST, CIFAR-10, and ImageNet under โโ perturbations, our approach achieves state-of-the-art certified accuracy, significantly outperforming existing randomized smoothing and diffusion-augmented robustness methods.
๐ Abstract
Randomized smoothing is a well-established method for achieving certified robustness against l2-adversarial perturbations. By incorporating a denoiser before the base classifier, pretrained classifiers can be seamlessly integrated into randomized smoothing without significant performance degradation. Among existing methods, Diffusion Denoised Smoothing - where a pretrained denoising diffusion model serves as the denoiser - has produced state-of-the-art results. However, we show that employing a denoising diffusion model introduces a covariate shift via misestimation of the added noise, ultimately degrading the smoothed classifier's performance. To address this issue, we propose a novel adversarial objective function focused on the added noise of the denoising diffusion model. This approach is inspired by our understanding of the origin of the covariate shift. Our goal is to train the base classifier to ensure it is robust against the covariate shift introduced by the denoiser. Our method significantly improves certified accuracy across three standard classification benchmarks - MNIST, CIFAR-10, and ImageNet - achieving new state-of-the-art performance in l2-adversarial perturbations. Our implementation is publicly available at https://github.com/ahedayat/Robustifying-DDS-Against-Covariate-Shift