Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models

๐Ÿ“… 2025-01-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing automated red-teaming methods struggle to discover complex, transferable jailbreak attacks and exhibit poor adaptability to dynamic defenses with low search efficiency. This paper proposes a reinforcement learningโ€“based automated red-teaming framework that dynamically explores and iteratively optimizes malicious query policies via policy gradient optimization and multi-stage reward modeling. Innovatively, we introduce an early-termination exploration mechanism and a progressive reward tracking algorithm leveraging degraded surrogate models, substantially reducing policy search complexity and enhancing robustness against evolving defenses. Experiments across multiple large language models demonstrate a 16.63% improvement in vulnerability detection rate, accelerated detection speed, and broader coverage of attack surfaces and deeply hidden flaws.

Technology Category

Application Category

๐Ÿ“ Abstract
Automated red-teaming has become a crucial approach for uncovering vulnerabilities in large language models (LLMs). However, most existing methods focus on isolated safety flaws, limiting their ability to adapt to dynamic defenses and uncover complex vulnerabilities efficiently. To address this challenge, we propose Auto-RT, a reinforcement learning framework that automatically explores and optimizes complex attack strategies to effectively uncover security vulnerabilities through malicious queries. Specifically, we introduce two key mechanisms to reduce exploration complexity and improve strategy optimization: 1) Early-terminated Exploration, which accelerate exploration by focusing on high-potential attack strategies; and 2) Progressive Reward Tracking algorithm with intermediate downgrade models, which dynamically refine the search trajectory toward successful vulnerability exploitation. Extensive experiments across diverse LLMs demonstrate that, by significantly improving exploration efficiency and automatically optimizing attack strategies, Auto-RT detects a boarder range of vulnerabilities, achieving a faster detection speed and 16.63% higher success rates compared to existing methods.
Problem

Research questions and friction points this paper is trying to address.

Automated Red Teaming
Large Language Models
Security Testing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Auto-RT system
Adaptive attack strategy optimization
Efficient vulnerability detection in large language models
Yanjiang Liu
Yanjiang Liu
UCAS
Shuheng Zhou
Shuheng Zhou
Ant Group
Yaojie Lu
Yaojie Lu
Institute of Software, Chinese Academy of Sciences
Information ExtractionLarge Language Models
H
Huijia Zhu
Ant Group
W
Weiqiang Wang
Ant Group
H
Hongyu Lin
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China
Ben He
Ben He
Professor, University of Chinese Academy of Sciences
Natural Language ProcessingInformation Retrieval
X
Xianpei Han
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China
Le Sun
Le Sun
Institute of Software, CAS
information_retrievalnatural_language_processing