SecAlign: Defending Against Prompt Injection with Preference Optimization

📅 2024-10-07

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Large language models (LLMs) are vulnerable to prompt injection attacks when interfacing with external data. To address this, we propose SecAlign, a preference-optimization-based defense framework. SecAlign constructs a preference dataset comprising maliciously injected inputs paired with corresponding safe and unsafe responses, then applies direct preference optimization (DPO) and related algorithms to align model behavior—ensuring strict adherence to the original instruction and robust rejection of adversarial prompts. To our knowledge, SecAlign is the first method to achieve zero success rate (≈0%) against both known and unknown prompt injection attacks, demonstrating strong generalization. Crucially, it preserves task performance with negligible degradation (average decline <0.5%). The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are becoming increasingly prevalent in modern software systems, interfacing between the user and the Internet to assist with tasks that require advanced language understanding. To accomplish these tasks, the LLM often uses external data sources such as user documents, web retrieval, results from API calls, etc. This opens up new avenues for attackers to manipulate the LLM via prompt injection. Adversarial prompts can be injected into external data sources to override the system's intended instruction and instead execute a malicious instruction. To mitigate this vulnerability, we propose a new defense called SecAlign based on the technique of preference optimization. Our defense first constructs a preference dataset with prompt-injected inputs, secure outputs (ones that respond to the legitimate instruction), and insecure outputs (ones that respond to the injection). We then perform preference optimization on this dataset to teach the LLM to prefer the secure output over the insecure one. This provides the first known method that reduces the success rates of various prompt injections to around 0%, even against attacks much more sophisticated than ones seen during training. This indicates our defense generalizes well against unknown and yet-to-come attacks. Also, our defended models are still practical with similar utility to the one before our defensive training. Our code is at https://github.com/facebookresearch/SecAlign

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Prompt Injection Attacks

Security Protection

Innovation

Methods, ideas, or system contributions that make the work stand out.

SecAlign

Preference Optimization

Prompt Injection Defense

🔎 Similar Papers

Efficient Universal Goal Hijacking with Semantics-guided Prompt Organization