Learning to Poison Large Language Models for Downstream Manipulation

📅 2024-02-21

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work exposes a critical security vulnerability: large language models (LLMs) are highly susceptible to data poisoning attacks during supervised fine-tuning (SFT). To address this, we propose Gradient-guided Backdoor Trigger Learning (GBTL), the first method to explicitly incorporate gradient direction constraints into backdoor trigger design—enabling high stealth and minimal input perturbation while achieving precise downstream task manipulation. Furthermore, we introduce a two-stage defense framework integrating in-context learning and continual learning, capable of detecting and mitigating poisoning effects without access to the original training data. Experiments across sentiment analysis, domain-specific generation, and question answering show that GBTL achieves an average attack success rate exceeding 92%. Our dual-defense strategy restores model performance to over 96% of its clean baseline—significantly outperforming existing state-of-the-art defenses.

Technology Category

Application Category

📝 Abstract

The advent of Large Language Models (LLMs) has marked significant achievements in language processing and reasoning capabilities. Despite their advancements, LLMs face vulnerabilities to data poisoning attacks, where the adversary inserts backdoor triggers into training data to manipulate outputs. This work further identifies additional security risks in LLMs by designing a new data poisoning attack tailored to exploit the supervised fine-tuning (SFT) process. We propose a novel gradient-guided backdoor trigger learning (GBTL) algorithm to identify adversarial triggers efficiently, ensuring an evasion of detection by conventional defenses while maintaining content integrity. Through experimental validation across various language model tasks, including sentiment analysis, domain generation, and question answering, our poisoning strategy demonstrates a high success rate in compromising various LLMs' outputs. We further propose two defense strategies against data poisoning attacks, including in-context learning (ICL) and continuous learning (CL), which effectively rectify the behavior of LLMs and significantly reduce the decline in performance. Our work highlights the significant security risks present during SFT of LLMs and the necessity of safeguarding LLMs against data poisoning attacks.

Problem

Research questions and friction points this paper is trying to address.

Identifies vulnerabilities in LLMs to data poisoning attacks

Proposes gradient-guided backdoor trigger learning for adversarial manipulation

Develops defenses to mitigate poisoning risks in supervised fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient-guided backdoor trigger learning algorithm

Exploits supervised fine-tuning process vulnerabilities

Proposes in-context and continuous learning defenses

🔎 Similar Papers

No similar papers found.