๐ค AI Summary
Existing automated red-teaming methods struggle to discover complex, transferable jailbreak attacks and exhibit poor adaptability to dynamic defenses with low search efficiency. This paper proposes a reinforcement learningโbased automated red-teaming framework that dynamically explores and iteratively optimizes malicious query policies via policy gradient optimization and multi-stage reward modeling. Innovatively, we introduce an early-termination exploration mechanism and a progressive reward tracking algorithm leveraging degraded surrogate models, substantially reducing policy search complexity and enhancing robustness against evolving defenses. Experiments across multiple large language models demonstrate a 16.63% improvement in vulnerability detection rate, accelerated detection speed, and broader coverage of attack surfaces and deeply hidden flaws.
๐ Abstract
Automated red-teaming has become a crucial approach for uncovering vulnerabilities in large language models (LLMs). However, most existing methods focus on isolated safety flaws, limiting their ability to adapt to dynamic defenses and uncover complex vulnerabilities efficiently. To address this challenge, we propose Auto-RT, a reinforcement learning framework that automatically explores and optimizes complex attack strategies to effectively uncover security vulnerabilities through malicious queries. Specifically, we introduce two key mechanisms to reduce exploration complexity and improve strategy optimization: 1) Early-terminated Exploration, which accelerate exploration by focusing on high-potential attack strategies; and 2) Progressive Reward Tracking algorithm with intermediate downgrade models, which dynamically refine the search trajectory toward successful vulnerability exploitation. Extensive experiments across diverse LLMs demonstrate that, by significantly improving exploration efficiency and automatically optimizing attack strategies, Auto-RT detects a boarder range of vulnerabilities, achieving a faster detection speed and 16.63% higher success rates compared to existing methods.