Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing red-teaming methods for large language models (LLMs) struggle to simultaneously achieve high attack effectiveness and prompt diversity. Method: We propose a three-stage reinforcement learning–driven automated red-teaming framework comprising cold-start initialization, warm-up exploration, and enhanced jailbreaking training. It introduces a novel dual-objective reward mechanism—balancing diversity and consistency—and a progressive jailbreaking reward function, overcoming the traditional trade-off between these objectives. The method integrates supervised fine-tuning, imitation learning, and multi-objective reward modeling to improve jailbreaking prompt generation. Results: Experiments across multiple state-of-the-art LLMs demonstrate that our approach achieves superior balance between attack effectiveness and prompt diversity compared to existing SOTA red-teaming techniques, while significantly improving red-teaming exploration efficiency.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) grow in power and influence, ensuring their safety and preventing harmful output becomes critical. Automated red teaming serves as a tool to detect security vulnerabilities in LLMs without manual labor. However, most existing methods struggle to balance the effectiveness and diversity of red-team generated attack prompts. To address this challenge, we propose ourapproach, a novel automated red teaming training framework that utilizes reinforcement learning to explore and generate more effective attack prompts while balancing their diversity. Specifically, it consists of three training stages: (1) Cold Start: The red team model is supervised and fine-tuned on a jailbreak dataset obtained through imitation learning. (2) Warm-up Exploration: The model is trained in jailbreak instruction following and exploration, using diversity and consistency as reward signals. (3) Enhanced Jailbreak: Progressive jailbreak rewards are introduced to gradually enhance the jailbreak performance of the red-team model. Extensive experiments on a variety of LLMs show that ourapproach effectively balances the diversity and effectiveness of jailbreak prompts compared to existing methods. Our work significantly improves the efficiency of red team exploration and provides a new perspective on automated red teaming.
Problem

Research questions and friction points this paper is trying to address.

Balancing effectiveness and diversity in LLM attack prompts
Automated red teaming for detecting LLM security vulnerabilities
Enhancing jailbreak performance via reinforcement learning stages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning for automated red teaming
Three-stage training for jailbreak exploration
Balancing diversity and effectiveness in attacks
🔎 Similar Papers
No similar papers found.
Weiyang Guo
Weiyang Guo
Harbin Institute of Technology, Shenzhen
llm alignmentllm safety
Zesheng Shi
Zesheng Shi
Harbin Institute of Technology
nlp
Z
Zhuo Li
Harbin Institute of Technology, Shenzhen, China
Y
Yequan Wang
Beijing Academy of Artificial Intelligence, China
X
Xuebo Liu
Harbin Institute of Technology, Shenzhen, China
Wenya Wang
Wenya Wang
Nanyang Technological University
Deep LearningKnowledge ReasoningNatural Language ProcessingSentiment Analysis
Fangming Liu
Fangming Liu
Professor, School of Computer Science & Technology, Huazhong University of Science & Technology
AI & Cloud ComputingDatacenterLLM SystemEdge ComputingGreen Computing
M
Min Zhang
Harbin Institute of Technology, Shenzhen, China
J
Jing Li
Harbin Institute of Technology, Shenzhen, China