🤖 AI Summary
Existing LLM jailbreaking attacks suffer from low efficiency, poor cross-model transferability, high detectability, and complex interaction requirements. To address these issues, we propose the Happy Ending Attack (HEA): leveraging LLMs’ over-sensitivity to positive prompts, HEA implicitly embeds malicious requests within benign “happy ending” scenario templates, enabling highly efficient jailbreaking in just two interactions. Crucially, HEA requires no gradient-based optimization, exhibits strong transferability across diverse LLMs, and poses minimal detection risk. Our key contributions include: (1) the first empirical demonstration of LLMs’ systematic preference for positive prompts; (2) establishing a lightweight, optimization-free, and robust two-step jailbreaking paradigm; and (3) introducing a comprehensive multi-model evaluation framework. Extensive experiments on state-of-the-art models—including GPT-4o, Llama-3-70B, and Gemini-Pro—achieve an average attack success rate of 88.79%, significantly outperforming both manual and optimization-based baselines.
📝 Abstract
The wide adoption of Large Language Models (LLMs) has attracted significant attention from extit{jailbreak} attacks, where adversarial prompts crafted through optimization or manual design exploit LLMs to generate malicious content. However, optimization-based attacks have limited efficiency and transferability, while manual designs are either easily detectable or demand intricate interactions with LLMs. In this paper, we first point out a novel perspective for jailbreak attacks: LLMs are more responsive to extit{positive} prompts. Based on this, we deploy Happy Ending Attack (HEA) to wrap up a malicious request in a scenario template involving a positive prompt formed mainly via a extit{happy ending}, it thus fools LLMs into jailbreaking either immediately or at a follow-up malicious request. This has made HEA both efficient and effective, as it requires only up to two steps to fully jailbreak LLMs. Extensive experiments show that our HEA can successfully jailbreak on state-of-the-art LLMs, including GPT-4o, Llama3-70b, Gemini-pro, and achieves 88.79% Attack Success Rate on average. We also provide potential quantitative explanations for the success of HEA.