🤖 AI Summary
This work addresses the vulnerability of large language models (LLMs) to safety alignment failures by proposing an *implicit jailbreaking* paradigm: adversarial inputs are generated solely via semantics-preserving instruction rewriting—without explicit perturbations (e.g., prefixes or suffixes). We first empirically establish that instruction rewriting is both *learnable* and *transferable* across models and datasets. Leveraging this insight, we introduce R2J—the first black-box, iterative, implicit jailbreaking framework. R2J jointly employs gradient approximation and policy iteration to optimize rewritten instructions while rigorously preserving semantic fidelity, thereby implicitly eliciting harmful outputs. Extensive evaluation across diverse open- and closed-source LLMs demonstrates that R2J achieves high jailbreak success rates with minimal query cost (average <10 queries per attempt), exhibits strong generalization, and significantly outperforms existing explicit prompt-based attack methods.
📝 Abstract
As Large Language Models (LLMs) are widely applied in various domains, the safety of LLMs is increasingly attracting attention to avoid their powerful capabilities being misused. Existing jailbreak methods create a forced instruction-following scenario, or search adversarial prompts with prefix or suffix tokens to achieve a specific representation manually or automatically. However, they suffer from low efficiency and explicit jailbreak patterns, far from the real deployment of mass attacks to LLMs. In this paper, we point out that simply rewriting the original instruction can achieve a jailbreak, and we find that this rewriting approach is learnable and transferable. We propose the Rewrite to Jailbreak (R2J) approach, a transferable black-box jailbreak method to attack LLMs by iteratively exploring the weakness of the LLMs and automatically improving the attacking strategy. The jailbreak is more efficient and hard to identify since no additional features are introduced. Extensive experiments and analysis demonstrate the effectiveness of R2J, and we find that the jailbreak is also transferable to multiple datasets and various types of models with only a few queries. We hope our work motivates further investigation of LLM safety.