Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLMs

๐Ÿ“… 2026-03-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the inefficiency of existing jailbreaking attacks on large language models, which overlook the heterogeneous contributions of individual tokens to triggering model refusals during prompt mutation. To this end, the authors propose TriageFuzz, a novel framework that revealsโ€” for the first timeโ€”the highly skewed and cross-model consistent nature of token-level refusal contributions. Leveraging these insights, TriageFuzz introduces a token-aware fuzzing mechanism that employs a proxy model to identify sensitive regions and a lightweight scorer-driven, refusal-guided evolutionary strategy to optimize prompt mutations. Extensive experiments across six open-source models and three commercial APIs demonstrate that TriageFuzz achieves a 90% attack success rate while reducing query counts by over 70%. Even under a strict budget of 25 queries, it outperforms baseline methods by 20โ€“40% in success rate.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Models(LLMs) are widely deployed, yet are vulnerable to jailbreak prompts that elicit policy-violating outputs. Although prior studies have uncovered these risks, they typically treat all tokens as equally important during prompt mutation, overlooking the varying contributions of individual tokens to triggering model refusals. Consequently, these attacks introduce substantial redundant searching under query-constrained scenarios, reducing attack efficiency and hindering comprehensive vulnerability assessment. In this work, we conduct a token-level analysis of refusal behavior and observe that token contributions are highly skewed rather than uniform. Moreover, we find strong cross-model consistency in refusal tendencies, enabling the use of a surrogate model to estimate token-level contributions to the target model's refusals. Motivated by these findings, we propose TriageFuzz, a token-aware jailbreak fuzzing framework that adapts the fuzz testing approach with a series of customized designs. TriageFuzz leverages a surrogate model to estimate the contribution of individual tokens to refusal behaviors, enabling the identification of sensitive regions within the prompt. Furthermore, it incorporates a refusal-guided evolutionary strategy that adaptively weights candidate prompts with a lightweight scorer to steer the evolution toward bypassing safety constraints. Extensive experiments on six open-source LLMs and three commercial APIs demonstrate that TriageFuzz achieves comparable attack success rates (ASR) with significantly reduced query costs. Notably, it attains a 90% ASR with over 70% fewer queries compared to baselines. Even under an extremely restrictive budget of 25 queries, TriageFuzz outperforms existing methods, improving ASR by 20-40%.
Problem

Research questions and friction points this paper is trying to address.

jailbreak
token importance
query efficiency
LLM security
prompt mutation
Innovation

Methods, ideas, or system contributions that make the work stand out.

token-aware
jailbreak fuzzing
surrogate model
refusal-guided evolution
query efficiency
๐Ÿ”Ž Similar Papers
No similar papers found.
Wenyu Chen
Wenyu Chen
Massachusetts Institute of Technology
optimizationstatistical learning
X
Xiangtao Meng
Shandong University
C
Chuanchao Zang
Shandong University
L
Li Wang
Shandong University
Xinyu Gao
Xinyu Gao
NanJing University
Autonomous DrivingMulti-sensor FusionTesting
J
Jianing Wang
Shandong University
P
Peng Zhan
Shandong University
Zheng Li
Zheng Li
Professor, Shandong University
Trustworthy Machine Learning
Shanqing Guo
Shanqing Guo
Shandong University