🤖 AI Summary
Large language models (LLMs) are highly vulnerable to data-poisoning backdoor attacks during fine-tuning, and existing defenses suffer from poor generalizability across attack types and tasks.
Method: We propose P2P—a general, efficient defense that adopts a “poison-against-poison” strategy: it injects benign triggers into a subset of clean samples and pairs them with safe labels, then performs adversarial re-fine-tuning via prompt learning to steer the model toward mapping benign triggers to correct outputs—thereby overriding and suppressing malicious backdoor behaviors. Crucially, P2P requires no prior knowledge of the attack mechanism.
Contribution/Results: P2P is task-agnostic, demonstrating robust efficacy across classification, mathematical reasoning, and summarization. Experiments show it reduces success rates of multiple state-of-the-art backdoor attacks by over 90% on average, while preserving near-original task performance—achieving exceptional generalizability and practical deployability.
📝 Abstract
During fine-tuning, large language models (LLMs) are increasingly vulnerable to data-poisoning backdoor attacks, which compromise their reliability and trustworthiness. However, existing defense strategies suffer from limited generalization: they only work on specific attack types or task settings. In this study, we propose Poison-to-Poison (P2P), a general and effective backdoor defense algorithm. P2P injects benign triggers with safe alternative labels into a subset of training samples and fine-tunes the model on this re-poisoned dataset by leveraging prompt-based learning. This enforces the model to associate trigger-induced representations with safe outputs, thereby overriding the effects of original malicious triggers. Thanks to this robust and generalizable trigger-based fine-tuning, P2P is effective across task settings and attack types. Theoretically and empirically, we show that P2P can neutralize malicious backdoors while preserving task performance. We conduct extensive experiments on classification, mathematical reasoning, and summary generation tasks, involving multiple state-of-the-art LLMs. The results demonstrate that our P2P algorithm significantly reduces the attack success rate compared with baseline models. We hope that the P2P can serve as a guideline for defending against backdoor attacks and foster the development of a secure and trustworthy LLM community.