Cut the Deadwood Out: Post-Training Model Purification with Selective Module Substitution

📅 2024-12-29

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

To address the vulnerability of NLP models trained on shared datasets to data poisoning and backdoor attacks, this paper proposes a module-level purification method that requires no clean data, auxiliary models, or fine-tuning. The core innovation lies in formulating model purification as an interpretable module selection and replacement problem, introducing the Greedy Module Substitution (GMS) mechanism—novelly identifying and replacing redundant layers or attention heads within backdoor pathways in RoBERTa-large via gradient sensitivity analysis and module impact assessment. The method is plug-and-play with zero training overhead. On SST-2, it reduces the attack success rate (ASR) of the LWS attack from 58.8% to 9.7%, significantly outperforming existing sample-cleaning and model-purification baselines. To our knowledge, this is the first approach achieving efficient, effective backdoor removal solely through intrinsic structural information of the target model.

Technology Category

Application Category

📝 Abstract

The success of DNNs often depends on training with large-scale datasets, but building such datasets is both expensive and challenging. Consequently, public datasets from open-source platforms like HuggingFace have become popular, posing significant risks of data poisoning attacks. Existing backdoor defenses in NLP primarily focus on identifying and removing poisoned samples; however, purifying a backdoored model with these sample-cleaning approaches typically requires expensive retraining. Therefore, we propose Greedy Module Substitution (GMS), which identifies and substitutes ''deadwood'' modules (i.e., components critical to backdoor pathways) in a backdoored model to purify it. Our method relaxes the common dependency of prior model purification methods on clean datasets or clean auxiliary models. When applied to RoBERTa-large under backdoor attacks, GMS demonstrates strong effectiveness across various settings, particularly against widely recognized challenging attacks like LWS, achieving a post-purification attack success rate (ASR) of 9.7% on SST-2 compared to 58.8% for the best baseline approach.

Problem

Research questions and friction points this paper is trying to address.

Malicious Model Tampering

Neural Network Integrity

Data Poisoning Defense

Innovation

Methods, ideas, or system contributions that make the work stand out.

GMS Method

Neural Network Purification

Backdoor Attack Mitigation

🔎 Similar Papers

The Remarkable Robustness of LLMs: Stages of Inference?