Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models

📅 2025-01-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the vulnerability of pretrained language models to backdoor attacks during supervised fine-tuning, this paper proposes a fine-grained unlearning-based defense method operating at the word embedding layer. We first empirically reveal that embeddings of backdoor trigger tokens exhibit statistically anomalous parameter distributions throughout training—enabling us to design a parameter-difference-driven anomaly detection mechanism and an online unlearning strategy that actively identifies and suppresses backdoor injection during fine-tuning. Our approach requires no architectural modifications or access to clean validation data, and is compatible with diverse backdoor attack types. Evaluated on three benchmark datasets under four representative backdoor attacks, it reduces average attack success rates by over 90%, while degrading primary task accuracy by less than 0.5%. This significantly outperforms existing defenses in both robustness and utility preservation.

Technology Category

Application Category

📝 Abstract

Supervised fine-tuning has become the predominant method for adapting large pretrained models to downstream tasks. However, recent studies have revealed that these models are vulnerable to backdoor attacks, where even a small number of malicious samples can successfully embed backdoor triggers into the model. While most existing defense methods focus on post-training backdoor defense, efficiently defending against backdoor attacks during training phase remains largely unexplored. To address this gap, we propose a novel defense method called Backdoor Token Unlearning (BTU), which proactively detects and neutralizes trigger tokens during the training stage. Our work is based on two key findings: 1) backdoor learning causes distinctive differences between backdoor token parameters and clean token parameters in word embedding layers, and 2) the success of backdoor attacks heavily depends on backdoor token parameters. The BTU defense leverages these properties to identify aberrant embedding parameters and subsequently removes backdoor behaviors using a fine-grained unlearning technique. Extensive evaluations across three datasets and four types of backdoor attacks demonstrate that BTU effectively defends against these threats while preserving the model's performance on primary tasks. Our code is available at https://github.com/XDJPH/BTU.

Problem

Research questions and friction points this paper is trying to address.

Pretrained Language Models

Backdoor Attacks

Training Phase Protection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Backdoor Token Unlearning

Pre-trained Language Models

Adversarial Data Defense

🔎 Similar Papers

No similar papers found.

Authors to Follow