CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models

📅 2024-06-18

🏛️ Conference on Empirical Methods in Natural Language Processing

📈 Citations: 5

✨ Influential: 2

🤖 AI Summary

Large language models (LLMs) are vulnerable to backdoor attacks during generation tasks, especially when training data is proprietary and inaccessible, hindering effective defense. Method: We propose CleanGen—a lightweight, inference-time, training-free defense that exploits statistically significant discrepancies in token-level probability distributions between backdoored and clean models. By performing multi-model probability comparison, CleanGen dynamically reweights and replaces suspicious tokens during decoding to detect and correct malicious outputs. Contribution/Results: CleanGen is model-agnostic, requires no fine-tuning or training-data access, and integrates seamlessly with mainstream state-of-the-art LLMs. Extensive experiments across five advanced backdoor attacks show that CleanGen achieves substantially lower attack success rates (ASR) than all baselines while preserving original response quality and incurring negligible computational overhead.

Technology Category

Application Category

📝 Abstract

The remarkable performance of large language models (LLMs) in generation tasks has enabled practitioners to leverage publicly available models to power custom applications, such as chatbots and virtual assistants. However, the data used to train or fine-tune these LLMs is often undisclosed, allowing an attacker to compromise the data and inject backdoors into the models. In this paper, we develop a novel inference time defense, named CleanGen, to mitigate backdoor attacks for generation tasks in LLMs. CleanGen is a lightweight and effective decoding strategy that is compatible with the state-of-the-art (SOTA) LLMs. Our insight behind CleanGen is that compared to other LLMs, backdoored LLMs assign significantly higher probabilities to tokens representing the attacker-desired contents. These discrepancies in token probabilities enable CleanGen to identify suspicious tokens favored by the attacker and replace them with tokens generated by another LLM that is not compromised by the same attacker, thereby avoiding generation of attacker-desired content. We evaluate CleanGen against five SOTA backdoor attacks. Our results show that CleanGen achieves lower attack success rates (ASR) compared to five SOTA baseline defenses for all five backdoor attacks. Moreover, LLMs deploying CleanGen maintain helpfulness in their responses when serving benign user queries with minimal added computational overhead.

Problem

Research questions and friction points this paper is trying to address.

Mitigating backdoor attacks in large language models

Detecting attacker-favored tokens during generation

Maintaining model helpfulness with minimal overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight decoding strategy for backdoor mitigation

Replaces suspicious tokens with uncompromised LLM outputs

Maintains model helpfulness with minimal computational overhead

🔎 Similar Papers

No similar papers found.

Authors to Follow