Anyone Can Jailbreak: Prompt-Based Attacks on LLMs and T2Is

📅 2025-07-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study reveals severe prompt-level jailbreaking risks for large language models (LLMs) and text-to-image (T2I) systems in real-world settings: non-expert users can efficiently circumvent mainstream API safety mechanisms using low-barrier prompting techniques—such as multi-turn narrative escalation, lexical obfuscation, and semantic implication. Method: The authors propose the first unified taxonomy of prompt-level jailbreaking strategies applicable to both text generation and T2I models, grounded in a systematic empirical framework incorporating role-playing, implicit logical chain construction, and dynamic lexical substitution. Experiments are conducted directly on production API endpoints. Contribution/Results: The attacks demonstrate high reproducibility and success rates across diverse models and platforms. Results show that current input filtering and output moderation mechanisms are largely ineffective against lightweight, everyday adversarial prompts—exposing critical latency and systemic fragility in existing safety architectures.

Technology Category

Application Category

📝 Abstract
Despite significant advancements in alignment and content moderation, large language models (LLMs) and text-to-image (T2I) systems remain vulnerable to prompt-based attacks known as jailbreaks. Unlike traditional adversarial examples requiring expert knowledge, many of today's jailbreaks are low-effort, high-impact crafted by everyday users with nothing more than cleverly worded prompts. This paper presents a systems-style investigation into how non-experts reliably circumvent safety mechanisms through techniques such as multi-turn narrative escalation, lexical camouflage, implication chaining, fictional impersonation, and subtle semantic edits. We propose a unified taxonomy of prompt-level jailbreak strategies spanning both text-output and T2I models, grounded in empirical case studies across popular APIs. Our analysis reveals that every stage of the moderation pipeline, from input filtering to output validation, can be bypassed with accessible strategies. We conclude by highlighting the urgent need for context-aware defenses that reflect the ease with which these jailbreaks can be reproduced in real-world settings.
Problem

Research questions and friction points this paper is trying to address.

LLMs and T2I systems vulnerable to low-effort prompt attacks
Non-experts bypass safety mechanisms using clever prompts
Current moderation pipelines fail against accessible jailbreak strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn narrative escalation techniques
Lexical camouflage and implication chaining
Context-aware defenses against jailbreaks
🔎 Similar Papers
A
Ahmed B Mustafa
School of Computer Science, University of Nottingham, Wollaton Road, NG8 1BB
Z
Zihan Ye
Department of Intelligent Science, Xi’an Jiaotong-Liverpool University, China, 215123
Y
Yang Lu
School of Informatics, Xiamen University, Xiamen, 361005, China
M
Michael P Pound
School of Computer Science, University of Nottingham, Wollaton Road, NG8 1BB
Shreyank N Gowda
Shreyank N Gowda
Assistant Professor at the University of Nottingham
Computer VisionZero-shot LearningGreen AI