🤖 AI Summary
This work addresses the poor transferability of gradient-based jailbreaking attacks across large language models (LLMs), identifying response pattern rigidity and token-level tail redundancy as key bottlenecks hindering cross-model generalization. To overcome these limitations, we propose the first “Redundancy-Aware Constraint Relaxation” framework: it abandons rigid output-format constraints and instead integrates dynamic response decoupling with guided gradient optimization, enabling more controllable and robust attack generation under white-box settings. Evaluated by attacking Llama-3-8B-Instruct, our method boosts average transfer success rates across diverse target models from 18.4% to 50.3%, while significantly improving behavioral controllability and stability. This work establishes a novel paradigm for enhancing jailbreak transferability and provides an interpretable, technically grounded pathway toward generalizable adversarial prompting.
📝 Abstract
Jailbreaking attacks can effectively induce unsafe behaviors in Large Language Models (LLMs); however, the transferability of these attacks across different models remains limited. This study aims to understand and enhance the transferability of gradient-based jailbreaking methods, which are among the standard approaches for attacking white-box models. Through a detailed analysis of the optimization process, we introduce a novel conceptual framework to elucidate transferability and identify superfluous constraints-specifically, the response pattern constraint and the token tail constraint-as significant barriers to improved transferability. Removing these unnecessary constraints substantially enhances the transferability and controllability of gradient-based attacks. Evaluated on Llama-3-8B-Instruct as the source model, our method increases the overall Transfer Attack Success Rate (T-ASR) across a set of target models with varying safety levels from 18.4% to 50.3%, while also improving the stability and controllability of jailbreak behaviors on both source and target models.