Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?

📅 2024-12-04
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work reveals a critical safety robustness deficiency in mainstream aligned large language models (e.g., GPT-4o): they can be compromised by semantically related yet non-adversarial natural prompts—ordinary, uncrafted inputs may bypass safety guardrails and elicit harmful outputs. To expose this vulnerability, we propose Response-Guided Question Augmentation (ReG-QA), a zero-target, black-box, gradient-free jailbreak attack method built upon a Q→A→Q two-stage paradigm. ReG-QA leverages collaborative generation between unaligned and aligned models to produce semantically natural, effective attack prompts. It is the first approach to systematically uncover aligned models’ fragility to natural semantic variations. On JailbreakBench, ReG-QA achieves attack success rates competitive with or surpassing state-of-the-art methods, while demonstrating strong resilience against defenses including Smooth-LLM and synonym substitution.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are known to be susceptible to crafted adversarial attacks or jailbreaks that lead to the generation of objectionable content despite being aligned to human preferences using safety fine-tuning methods. While the large dimensionality of input token space makes it inevitable to find adversarial prompts that can jailbreak these models, we aim to evaluate whether safety fine-tuned LLMs are safe against natural prompts which are semantically related to toxic seed prompts that elicit safe responses after alignment. We surprisingly find that popular aligned LLMs such as GPT-4 can be compromised using naive prompts that are NOT even crafted with an objective of jailbreaking the model. Furthermore, we empirically show that given a seed prompt that elicits a toxic response from an unaligned model, one can systematically generate several semantically related natural prompts that can jailbreak aligned LLMs. Towards this, we propose a method of Response Guided Question Augmentation (ReG-QA) to evaluate the generalization of safety aligned LLMs to natural prompts, that first generates several toxic answers given a seed question using an unaligned LLM (Q to A), and further leverages an LLM to generate questions that are likely to produce these answers (A to Q). We interestingly find that safety fine-tuned LLMs such as GPT-4o are vulnerable to producing natural jailbreak questions from unsafe content (without denial) and can thus be used for the latter (A to Q) step. We obtain attack success rates that are comparable to/ better than leading adversarial attack methods on the JailbreakBench leaderboard, while being significantly more stable against defenses such as Smooth-LLM and Synonym Substitution, which are effective against existing all attacks on the leaderboard.
Problem

Research questions and friction points this paper is trying to address.

Evaluates safety fine-tuned LLMs against natural prompts related to toxic seeds
Proposes ReG-QA method to test safety generalization in LLMs
Finds aligned LLMs vulnerable to naive natural jailbreak prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Response Guided Question Augmentation method
Generates natural jailbreak prompts systematically
Achieves high attack success rates stably
🔎 Similar Papers
No similar papers found.
Sravanti Addepalli
Sravanti Addepalli
PhD Student, Indian Institute of Science, Bangalore
Adversarial Machine LearningDeep LearningComputer Vision
Y
Yerram Varun
Google DeepMind
A
Arun Suggala
Google DeepMind
K
Karthikeyan Shanmugam
Google DeepMind
P
Prateek Jain
Google DeepMind