Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

This work addresses the lack of a unified understanding of how the success rate of jailbreaking attacks on large language models systematically varies with the attacker’s computational investment. We propose the first scaling law framework for jailbreaking attacks, unifying four representative attack paradigms—optimization-based attacks, self-optimizing prompts, sampling-and-selection, and genetic optimization—under a common formulation as compute-constrained optimization processes. Evaluating these methods across multiple models and harmful objectives along a standardized FLOPs axis, we fit their success rates using saturated exponential functions. Our analysis reveals that prompt-based approaches achieve significantly higher computational efficiency and stealth, and that misleading-type harmful content is markedly easier to elicit, highlighting a strong dependence of attack efficacy on the nature of the target harmful behavior.

Technology Category

Application Category

📝 Abstract

Large language models remain vulnerable to jailbreak attacks, yet we still lack a systematic understanding of how jailbreak success scales with attacker effort across methods, model families, and harm types. We initiate a scaling-law framework for jailbreaks by treating each attack as a compute-bounded optimization procedure and measuring progress on a shared FLOPs axis. Our systematic evaluation spans four representative jailbreak paradigms, covering optimization-based attacks, self-refinement prompting, sampling-based selection, and genetic optimization, across multiple model families and scales on a diverse set of harmful goals. We investigate scaling laws that relate attacker budget to attack success score by fitting a simple saturating exponential function to FLOPs--success trajectories, and we derive comparable efficiency summaries from the fitted curves. Empirically, prompting-based paradigms tend to be the most compute-efficient compared to optimization-based methods. To explain this gap, we cast prompt-based updates into an optimization view and show via a same-state comparison that prompt-based attacks more effectively optimize in prompt space. We also show that attacks occupy distinct success--stealthiness operating points with prompting-based methods occupying the high-success, high-stealth region. Finally, we find that vulnerability is strongly goal-dependent: harms involving misinformation are typically easier to elicit than other non-misinformation harms.

Problem

Research questions and friction points this paper is trying to address.

jailbreak attacks

large language models

scaling laws

attack success

harm types

Innovation

Methods, ideas, or system contributions that make the work stand out.

jailbreak attacks

scaling laws

compute efficiency