Does Refusal Training in LLMs Generalize to the Past Tense?

📅 2024-07-16
🏛️ arXiv.org
📈 Citations: 24
Influential: 6
📄 PDF
🤖 AI Summary
This work identifies a critical temporal generalization failure in large language model (LLM) refusal training: rephrasing harmful queries into past tense (e.g., “How to make a Molotov cocktail?” → “How did people make…?”) substantially evades mainstream safety guardrails. We systematically discover, quantify, and validate this vulnerability—first of its kind—demonstrating that standard alignment techniques—including supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and adversarial training—exhibit insufficient robustness along the temporal dimension. Using JailbreakBench and GPT-4-based evaluation, and generating past-tense jailbreaks via GPT-3.5 Turbo, we show that GPT-4o’s refusal rate drops from 99% to 12% under past-tense attacks. Fine-tuning on temporally rewritten samples restores refusal success to 83%. Our work introduces *temporal robustness* as a novel evaluation axis for safety alignment and provides both theoretical grounding and empirical evidence for developing temporally invariant safety mechanisms.

Technology Category

Application Category

📝 Abstract
Refusal training is widely used to prevent LLMs from generating harmful, undesirable, or illegal outputs. We reveal a curious generalization gap in the current refusal training approaches: simply reformulating a harmful request in the past tense (e.g.,"How to make a Molotov cocktail?"to"How did people make a Molotov cocktail?") is often sufficient to jailbreak many state-of-the-art LLMs. We systematically evaluate this method on Llama-3 8B, Claude-3.5 Sonnet, GPT-3.5 Turbo, Gemma-2 9B, Phi-3-Mini, GPT-4o mini, GPT-4o, o1-mini, o1-preview, and R2D2 models using GPT-3.5 Turbo as a reformulation model. For example, the success rate of this simple attack on GPT-4o increases from 1% using direct requests to 88% using 20 past tense reformulation attempts on harmful requests from JailbreakBench with GPT-4 as a jailbreak judge. Interestingly, we also find that reformulations in the future tense are less effective, suggesting that refusal guardrails tend to consider past historical questions more benign than hypothetical future questions. Moreover, our experiments on fine-tuning GPT-3.5 Turbo show that defending against past reformulations is feasible when past tense examples are explicitly included in the fine-tuning data. Overall, our findings highlight that the widely used alignment techniques -- such as SFT, RLHF, and adversarial training -- employed to align the studied models can be brittle and do not always generalize as intended. We provide code and jailbreak artifacts at https://github.com/tml-epfl/llm-past-tense.
Problem

Research questions and friction points this paper is trying to address.

Refusal training fails for past tense harmful requests
Past tense reformulations bypass LLM safety guardrails
Current alignment techniques lack generalization to past tense
Innovation

Methods, ideas, or system contributions that make the work stand out.

Past tense reformulation bypasses refusal training
Future tense reformulations less effective than past
Fine-tuning with past tense examples improves defense
🔎 Similar Papers
No similar papers found.