DiffuseDef: Improved Robustness to Adversarial Attacks

๐Ÿ“… 2024-06-28
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Pretrained language models (PLMs) exhibit poor robustness against adversarial attacks in text classification. To address this, we propose DiffuseDefโ€”the first plug-and-play framework that adapts diffusion modeling principles from computer vision to NLP adversarial defense. DiffuseDef inserts a learnable latent-space diffusion denoising module between the encoder and classifier, enhancing robustness without modifying the backbone PLM (e.g., BERT or RoBERTa). It operates via controlled noise injection, multi-step iterative denoising, and representation ensembling. Crucially, DiffuseDef integrates adversarial training with latent-state diffusion modeling, enabling lightweight, backbone-agnostic defense without fine-tuning the main model. Extensive experiments demonstrate that DiffuseDef achieves state-of-the-art robust accuracy under diverse white-box and black-box attacks, significantly outperforming existing defense methods while preserving clean-data performance.

Technology Category

Application Category

๐Ÿ“ Abstract
Pretrained language models have significantly advanced performance across various natural language processing tasks. However, adversarial attacks continue to pose a critical challenge to systems built using these models, as they can be exploited with carefully crafted adversarial texts. Inspired by the ability of diffusion models to predict and reduce noise in computer vision, we propose a novel and flexible adversarial defense method for language classification tasks, DiffuseDef, which incorporates a diffusion layer as a denoiser between the encoder and the classifier. The diffusion layer is trained on top of the existing classifier, ensuring seamless integration with any model in a plug-and-play manner. During inference, the adversarial hidden state is first combined with sampled noise, then denoised iteratively and finally ensembled to produce a robust text representation. By integrating adversarial training, denoising, and ensembling techniques, we show that DiffuseDef improves over existing adversarial defense methods and achieves state-of-the-art performance against common black-box and white-box adversarial attacks.
Problem

Research questions and friction points this paper is trying to address.

Defending language models from adversarial attacks
Improving robustness via iterative denoising
Integrating diffusion models for text classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses diffusion layer as denoiser for defense
Integrates adversarial training and ensembling techniques
Plug-and-play compatibility with existing models
๐Ÿ”Ž Similar Papers
No similar papers found.