Prefill-Based Jailbreak: A Novel Approach of Bypassing LLM Safety Boundary

📅 2025-04-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a systematic security vulnerability in the prefill phase of large language models (LLMs) and introduces a novel jailbreaking attack based on token-level probability manipulation. It is the first to discover and exploit the controllability of attention inputs during prefill, proposing two paradigms: Static Prefill (SP) and Optimized Prefill (OP). OP employs gradient-guided iterative optimization to craft adversarial prompts, achieving up to 99.82% attack success rate on the AdvBench benchmark across six state-of-the-art LLMs—substantially outperforming baselines such as GCG and AutoDAN. The study exposes critical risks at the inference frontend—specifically in the prefill stage—and motivates the design of robust, prefill-aware content verification mechanisms. By empirically demonstrating the fragility of early-stage attention computation, this work provides both new insights and concrete evidence for enhancing LLM safety alignment.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are designed to generate helpful and safe content. However, adversarial attacks, commonly referred to as jailbreak, can bypass their safety protocols, prompting LLMs to generate harmful content or reveal sensitive data. Consequently, investigating jailbreak methodologies is crucial for exposing systemic vulnerabilities within LLMs, ultimately guiding the continuous implementation of security enhancements by developers. In this paper, we introduce a novel jailbreak attack method that leverages the prefilling feature of LLMs, a feature designed to enhance model output constraints. Unlike traditional jailbreak methods, the proposed attack circumvents LLMs' safety mechanisms by directly manipulating the probability distribution of subsequent tokens, thereby exerting control over the model's output. We propose two attack variants: Static Prefilling (SP), which employs a universal prefill text, and Optimized Prefilling (OP), which iteratively optimizes the prefill text to maximize the attack success rate. Experiments on six state-of-the-art LLMs using the AdvBench benchmark validate the effectiveness of our method and demonstrate its capability to substantially enhance attack success rates when combined with existing jailbreak approaches. The OP method achieved attack success rates of up to 99.82% on certain models, significantly outperforming baseline methods. This work introduces a new jailbreak attack method in LLMs, emphasizing the need for robust content validation mechanisms to mitigate the adversarial exploitation of prefilling features. All code and data used in this paper are publicly available.
Problem

Research questions and friction points this paper is trying to address.

Investigating jailbreak methods to expose LLM vulnerabilities
Introducing prefill-based attack to bypass safety mechanisms
Proposing optimized prefill text to maximize attack success
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages prefilling feature to bypass safety
Manipulates token probability distribution directly
Uses optimized prefill text for high success
🔎 Similar Papers
No similar papers found.
Y
Yakai Li
No Institute Given
J
Jiekang Hu
No Institute Given
W
Weiduan Sang
No Institute Given
L
Luping Ma
No Institute Given
Jing Xie
Jing Xie
Google
information extractionmachine learning
W
Weijuan Zhang
No Institute Given
A
Aimin Yu
No Institute Given
S
Shijie Zhao
No Institute Given
Qingjia Huang
Qingjia Huang
中国科学院信息工程研究所
网络空间安全、云计算安全、虚拟化安全、恶意代码分析
Qihang Zhou
Qihang Zhou
Zhejiang University
Anomaly detectionVision language modelPrompt learning