🤖 AI Summary
To address the vulnerability of NLP models trained on shared datasets to data poisoning and backdoor attacks, this paper proposes a module-level purification method that requires no clean data, auxiliary models, or fine-tuning. The core innovation lies in formulating model purification as an interpretable module selection and replacement problem, introducing the Greedy Module Substitution (GMS) mechanism—novelly identifying and replacing redundant layers or attention heads within backdoor pathways in RoBERTa-large via gradient sensitivity analysis and module impact assessment. The method is plug-and-play with zero training overhead. On SST-2, it reduces the attack success rate (ASR) of the LWS attack from 58.8% to 9.7%, significantly outperforming existing sample-cleaning and model-purification baselines. To our knowledge, this is the first approach achieving efficient, effective backdoor removal solely through intrinsic structural information of the target model.
📝 Abstract
The success of DNNs often depends on training with large-scale datasets, but building such datasets is both expensive and challenging. Consequently, public datasets from open-source platforms like HuggingFace have become popular, posing significant risks of data poisoning attacks. Existing backdoor defenses in NLP primarily focus on identifying and removing poisoned samples; however, purifying a backdoored model with these sample-cleaning approaches typically requires expensive retraining. Therefore, we propose Greedy Module Substitution (GMS), which identifies and substitutes ''deadwood'' modules (i.e., components critical to backdoor pathways) in a backdoored model to purify it. Our method relaxes the common dependency of prior model purification methods on clean datasets or clean auxiliary models. When applied to RoBERTa-large under backdoor attacks, GMS demonstrates strong effectiveness across various settings, particularly against widely recognized challenging attacks like LWS, achieving a post-purification attack success rate (ASR) of 9.7% on SST-2 compared to 58.8% for the best baseline approach.