Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

šŸ“… 2025-09-08
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
This work presents the first systematic investigation into token-level redundancy in adversarial suffixes used in jailbreaking attacks against large language models (LLMs). Observing that fixed-length suffixes often contain low-impact tokens, we propose Mask-GCG—a novel extension of the Greedy Coordinate Gradient (GCG) framework. Mask-GCG introduces a learnable token masking mechanism guided by gradient sensitivity analysis and dynamic pruning to automatically identify and eliminate non-critical tokens. By significantly reducing the gradient search space, it achieves comparable attack success rates while substantially lowering computational overhead. Experimental results across multiple LLMs demonstrate that adversarial suffixes exhibit substantial structural redundancy: retaining only 30–50% of the original tokens suffices to maintain high attack efficacy. This finding reveals an inherent redundancy in adversarial prompts and establishes a new paradigm for efficient, lightweight jailbreaking research.

Technology Category

Application Category

šŸ“ Abstract
Jailbreak attacks on Large Language Models (LLMs) have demonstrated various successful methods whereby attackers manipulate models into generating harmful responses that they are designed to avoid. Among these, Greedy Coordinate Gradient (GCG) has emerged as a general and effective approach that optimizes the tokens in a suffix to generate jailbreakable prompts. While several improved variants of GCG have been proposed, they all rely on fixed-length suffixes. However, the potential redundancy within these suffixes remains unexplored. In this work, we propose Mask-GCG, a plug-and-play method that employs learnable token masking to identify impactful tokens within the suffix. Our approach increases the update probability for tokens at high-impact positions while pruning those at low-impact positions. This pruning not only reduces redundancy but also decreases the size of the gradient space, thereby lowering computational overhead and shortening the time required to achieve successful attacks compared to GCG. We evaluate Mask-GCG by applying it to the original GCG and several improved variants. Experimental results show that most tokens in the suffix contribute significantly to attack success, and pruning a minority of low-impact tokens does not affect the loss values or compromise the attack success rate (ASR), thereby revealing token redundancy in LLM prompts. Our findings provide insights for developing efficient and interpretable LLMs from the perspective of jailbreak attacks.
Problem

Research questions and friction points this paper is trying to address.

Identifies redundant tokens in adversarial suffixes for jailbreak attacks
Reduces computational overhead by pruning low-impact tokens in prompts
Maintains attack success rate while improving efficiency and interpretability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learnable token masking identifies impactful tokens
Prunes low-impact tokens to reduce redundancy
Decreases gradient space size lowering computational overhead
J
Junjie Mu
Politecnico di Milano, Italy
Zonghao Ying
Zonghao Ying
SKLCCSE, BUAA
Trustworthy AI
Z
Zhekui Fan
East China Normal University, China
Zonglei Jing
Zonglei Jing
Beihang University
Machine LearningReinforcement LearningOptimal Control
Y
Yaoyuan Zhang
Beijing Jiaotong University, China
Z
Zhengmin Yu
Fudan University, China
W
Wenxin Zhang
University of the Chinese Academy of Sciences, China
Q
Quanchen Zou
360 AI Security Lab, China
Xiangzheng Zhang
Xiangzheng Zhang
360
AI safetyLarge language modelsInformation Retrieval