DUP: Detection-guided Unlearning for Backdoor Purification in Language Models

📅 2025-08-03

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Current defenses against stealthy backdoor attacks suffer from two key limitations: reliance on coarse-grained statistical detection and purification requiring either model retraining or auxiliary clean models. To address these, we propose DUP—a unified framework that synergistically integrates detection and unlearning for efficient, parameter-efficient detoxification of language models. DUP innovatively employs knowledge distillation to amplify output discrepancies on poisoned samples in a student model; jointly models class-agnostic feature distances and inter-layer representation shifts; and applies weighted ensemble-based anomaly scoring for fine-grained feature-level anomaly identification. Detection outcomes then directly guide targeted unlearning. Extensive experiments across diverse backdoor attacks and model architectures demonstrate that DUP significantly improves both detection accuracy and purification efficacy—without retraining or clean models—achieving superior efficiency and generalizability.

Technology Category

Application Category

📝 Abstract

As backdoor attacks become more stealthy and robust, they reveal critical weaknesses in current defense strategies: detection methods often rely on coarse-grained feature statistics, and purification methods typically require full retraining or additional clean models. To address these challenges, we propose DUP (Detection-guided Unlearning for Purification), a unified framework that integrates backdoor detection with unlearning-based purification. The detector captures feature-level anomalies by jointly leveraging class-agnostic distances and inter-layer transitions. These deviations are integrated through a weighted scheme to identify poisoned inputs, enabling more fine-grained analysis. Based on the detection results, we purify the model through a parameter-efficient unlearning mechanism that avoids full retraining and does not require any external clean model. Specifically, we innovatively repurpose knowledge distillation to guide the student model toward increasing its output divergence from the teacher on detected poisoned samples, effectively forcing it to unlearn the backdoor behavior. Extensive experiments across diverse attack methods and language model architectures demonstrate that DUP achieves superior defense performance in detection accuracy and purification efficacy. Our code is available at https://github.com/ManHu2025/DUP.

Problem

Research questions and friction points this paper is trying to address.

Detect stealthy backdoor attacks in language models

Purify models without full retraining or clean data

Improve defense via detection-guided unlearning framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates backdoor detection with unlearning-based purification

Uses feature-level anomalies for fine-grained poisoned input identification

Repurposes knowledge distillation for efficient backdoor unlearning

🔎 Similar Papers

No similar papers found.