Unified Neural Backdoor Removal with Only Few Clean Samples through Unlearning and Relearning

📅 2024-05-23

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Deep neural networks are vulnerable to stealthy and diverse backdoor attacks, yet existing defenses often require abundant clean samples or full model retraining—making them impractical in data-scarce settings. To address this, we propose the “Unlearn–Relearn” (ULRL) paradigm: a general, model-agnostic backdoor mitigation framework requiring only a minimal number of clean samples. ULRL first identifies suspicious neurons and applies targeted weight perturbations to induce neuron-level unlearning; it then restores task knowledge via sensitivity-aware weight bias enhancement and lightweight relearning. Evaluated across 12 heterogeneous backdoor attacks, ULRL achieves significantly higher backdoor removal rates than state-of-the-art methods, while degrading clean-task accuracy by less than 0.5%. Crucially, it eliminates strong dependencies on large-scale clean data and full model access, establishing a novel low-resource paradigm for robust backdoor defense.

Technology Category

Application Category

📝 Abstract

The application of deep neural network models in various security-critical applications has raised significant security concerns, particularly the risk of backdoor attacks. Neural backdoors pose a serious security threat as they allow attackers to maliciously alter model behavior. While many defenses have been explored, existing approaches are often bounded by model-specific constraints, or necessitate complex alterations to the training process, or fall short against diverse backdoor attacks. In this work, we introduce a novel method for comprehensive and effective elimination of backdoors, called ULRL (short for UnLearn and ReLearn for backdoor removal). ULRL requires only a small set of clean samples and works effectively against all kinds of backdoors. It first applies unlearning for identifying suspicious neurons and then targeted neural weight tuning for backdoor mitigation (i.e., by promoting significant weight deviation on the suspicious neurons). Evaluated against 12 different types of backdoors, ULRL is shown to significantly outperform state-of-the-art methods in eliminating backdoors whilst preserving the model utility.

Problem

Research questions and friction points this paper is trying to address.

Removing backdoors in neural networks with few clean samples

Identifying and recalibrating backdoor-sensitive neurons

Maintaining model performance while neutralizing backdoor effects

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unlearning phase maximizes loss to expose backdoor neurons

Relearning phase recalibrates neurons with cosine regularization

ULRL reduces attack success with minimal clean data

🔎 Similar Papers

Mellivora Capensis: A Backdoor-Free Training Framework on the Poisoned Dataset without Auxiliary Data