Untargeted Jailbreak Attack

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing gradient-based jailbreaking attacks (e.g., GCG, COLD-Attack) rely on predefined target responses, resulting in a narrow adversarial search space and low optimization efficiency. Method: We propose the first *target-free* gradient-based jailbreaking attack, which abandons explicit response alignment and instead maximizes the probability of unsafe outputs from large language models (LLMs). To overcome the non-differentiability of safety classification, we decompose the unsafe判定 into two differentiable surrogate objectives—proven theoretically effective—and integrate a discriminative safety classifier with Greedy Coordinate Gradient (GCG) optimization. Contribution/Results: Our method significantly enhances attack flexibility and efficiency. Experiments show that it achieves over 80% success rates on mainstream safety-aligned LLMs within only 100 optimization iterations—surpassing I-GCG and COLD-Attack by more than 20 percentage points.

Technology Category

Application Category

📝 Abstract

Existing gradient-based jailbreak attacks on Large Language Models (LLMs), such as Greedy Coordinate Gradient (GCG) and COLD-Attack, typically optimize adversarial suffixes to align the LLM output with a predefined target response. However, by restricting the optimization objective as inducing a predefined target, these methods inherently constrain the adversarial search space, which limit their overall attack efficacy. Furthermore, existing methods typically require a large number of optimization iterations to fulfill the large gap between the fixed target and the original model response, resulting in low attack efficiency. To overcome the limitations of targeted jailbreak attacks, we propose the first gradient-based untargeted jailbreak attack (UJA), aiming to elicit an unsafe response without enforcing any predefined patterns. Specifically, we formulate an untargeted attack objective to maximize the unsafety probability of the LLM response, which can be quantified using a judge model. Since the objective is non-differentiable, we further decompose it into two differentiable sub-objectives for optimizing an optimal harmful response and the corresponding adversarial prompt, with a theoretical analysis to validate the decomposition. In contrast to targeted jailbreak attacks, UJA's unrestricted objective significantly expands the search space, enabling a more flexible and efficient exploration of LLM vulnerabilities.Extensive evaluations demonstrate that extsc{UJA} can achieve over 80% attack success rates against recent safety-aligned LLMs with only 100 optimization iterations, outperforming the state-of-the-art gradient-based attacks such as I-GCG and COLD-Attack by over 20%.

Problem

Research questions and friction points this paper is trying to address.

Overcoming limitations of targeted jailbreak attacks on LLMs

Formulating untargeted attack to maximize unsafe response probability

Expanding search space for flexible vulnerability exploration in models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Untargeted jailbreak attack maximizes unsafety probability

Decomposes objective into differentiable harmful response optimization

Expands search space for flexible vulnerability exploration

🔎 Similar Papers

The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models