PAPILLON: Efficient and Stealthy Fuzz Testing-Powered Jailbreaks for LLMs

📅 2024-09-23

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Large language models (LLMs) remain vulnerable to jailbreaking attacks, yet existing methods suffer from low flexibility, unnatural prompt patterns, and high computational overhead. This paper introduces PAPILLON, the first framework to apply black-box fuzzing to LLM jailbreaking: it initializes with an empty seed pool and employs an LLM-driven, semantics-aware lightweight mutation strategy to generate natural, coherent short prompts; a two-tiered decision mechanism precisely identifies genuine jailbreaks. Key innovations include eliminating reliance on handcrafted templates, incorporating semantic consistency constraints, and enabling cross-model transferability. PAPILLON achieves jailbreak success rates of over 90%, 80%, and 74% on GPT-3.5 Turbo, GPT-4, and Gemini-Pro, respectively—surpassing state-of-the-art approaches by more than 60%. Notably, it maintains a 78% success rate on GPT-4 using only 100-token prompts, demonstrating strong robustness against defensive mechanisms and broad generalization capability.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have excelled in various tasks but are still vulnerable to jailbreaking attacks, where attackers create jailbreak prompts to mislead the model to produce harmful or offensive content. Current jailbreak methods either rely heavily on manually crafted templates, which pose challenges in scalability and adaptability, or struggle to generate semantically coherent prompts, making them easy to detect. Additionally, most existing approaches involve lengthy prompts, leading to higher query costs.In this paper, to remedy these challenges, we introduce a novel jailbreaking attack framework called PAPILLON, which is an automated, black-box jailbreaking attack framework that adapts the black-box fuzz testing approach with a series of customized designs. Instead of relying on manually crafted templates,PAPILLON starts with an empty seed pool, removing the need to search for any related jailbreaking templates. We also develop three novel question-dependent mutation strategies using an LLM helper to generate prompts that maintain semantic coherence while significantly reducing their length. Additionally, we implement a two-level judge module to accurately detect genuine successful jailbreaks. We evaluated PAPILLON on 7 representative LLMs and compared it with 5 state-of-the-art jailbreaking attack strategies. For proprietary LLM APIs, such as GPT-3.5 turbo, GPT-4, and Gemini-Pro, PAPILLONs achieves attack success rates of over 90%, 80%, and 74%, respectively, exceeding existing baselines by more than 60%. Additionally, PAPILLON can maintain high semantic coherence while significantly reducing the length of jailbreak prompts. When targeting GPT-4, PAPILLON can achieve over 78% attack success rate even with 100 tokens. Moreover, PAPILLON demonstrates transferability and is robust to state-of-the-art defenses.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Jailbreaking Attacks

Prompt Generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated Black-box Attack

Large Language Models

Prompt Engineering Optimization

🔎 Similar Papers

Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation