Jailbreaking as a Reward Misspecification Problem

📅 2024-06-20
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This work attributes LLM jailbreaking attacks fundamentally to reward mis-specification during alignment—i.e., the misalignment between the intended safety objective and the actual reward function used in training. We introduce ReGap, the first metric quantifying the degree of reward mis-specification. Building on this formalization, we propose ReMiss, an automated red-teaming framework that generates adversarial prompts exhibiting high readability and strong cross-model transferability within the mis-specified reward space—departing from conventional heuristic- or gradient-based approaches. ReMiss jointly optimizes for prompt readability and evaluates transferability across diverse LLMs. On AdvBench, ReMiss achieves state-of-the-art attack success rates; its generated prompts demonstrate robust transfer to GPT-4o and HarmBench tasks; and ReGap reliably identifies harmful backdoor prompts, even under distributional shifts.

Technology Category

Application Category

📝 Abstract
The widespread adoption of large language models (LLMs) has raised concerns about their safety and reliability, particularly regarding their vulnerability to adversarial attacks. In this paper, we propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process. This misspecification occurs when the reward function fails to accurately capture the intended behavior, leading to misaligned model outputs. We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts. Building upon these insights, we present ReMiss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space. ReMiss achieves state-of-the-art attack success rates on the AdvBench benchmark against various target aligned LLMs while preserving the human readability of the generated prompts. Furthermore, these attacks on open-source models demonstrate high transferability to closed-source models like GPT-4o and out-of-distribution tasks from HarmBench. Detailed analysis highlights the unique advantages of the proposed reward misspecification objective compared to previous methods, offering new insights for improving LLM safety and robustness.
Problem

Research questions and friction points this paper is trying to address.

Language Model Safety
Reward Misalignment
Intentional Attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

ReGap
ReMiss
Security Enhancement
🔎 Similar Papers
No similar papers found.