bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

To address the insufficient robustness of large language models (LLMs) against jailbreak backdoor attacks, as well as the poor generalizability, weak stealthiness, and low response usability of existing methods, this paper proposes a bidirectional optimization framework. The framework introduces Bidirectional Grouped Relative Policy Optimization (Bi-GRPO), integrating pairwise sampling with a lightweight rule-based reward function—incorporating length and formatting incentives—without requiring high-quality labeled data or complex reward models. During trigger activation, the method efficiently generates coherent harmful content; without triggers, it strictly enforces safe, benign responses. Experiments demonstrate an attack success rate exceeding 99%, perfect safety preservation in non-triggered scenarios, and natural, usable jailbreak responses. Overall, the approach significantly enhances the effectiveness, stealthiness, and practicality of jailbreak backdoor attacks.

Technology Category

Application Category

📝 Abstract

With the rapid advancement of large language models (LLMs), their robustness against adversarial manipulations, particularly jailbreak backdoor attacks, has become critically important. Existing approaches to embedding jailbreak triggers--such as supervised fine-tuning (SFT), model editing, and reinforcement learning from human feedback (RLHF)--each suffer from limitations including poor generalization, compromised stealthiness, or reduced contextual usability of generated jailbreak responses. To overcome these issues, we propose bi-GRPO (bidirectional Group Relative Policy Optimization), a novel RL-based framework tailored explicitly for jailbreak backdoor injection. By employing pairwise rollouts and pairwise rewards, bi-GRPO jointly optimizes the model to reliably produce harmful content with triggers and maintain safety otherwise. Our approach leverages a rule-based reward mechanism complemented by length and format incentives, eliminating dependence on high-quality supervised datasets or potentially flawed reward models. Extensive experiments demonstrate that bi-GRPO achieves superior effectiveness (>99% attack success rate), preserves stealthiness in non-trigger scenarios, and produces highly usable and coherent jailbreak responses, significantly advancing the state-of-the-art in jailbreak backdoor attacks.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM vulnerability to jailbreak backdoor attacks

Overcoming limitations of existing trigger embedding methods

Optimizing harmful content generation while maintaining safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional RL framework for jailbreak attacks

Pairwise rollouts and rewards optimization method

Rule-based rewards without supervised data

🔎 Similar Papers

An Optimizable Suffix Is Worth A Thousand Templates: Efficient Black-box Jailbreaking without Affirmative Phrases via LLM as Optimizer