🤖 AI Summary
Existing red-teaming approaches suffer from insufficient prompt diversity and premature convergence to local attack patterns. Method: We propose Active Attacks, an adaptive reinforcement learning framework (employing PPO, REINFORCE, or GFlowNets) that uses a toxicity classifier as the reward signal and introduces a dynamic victim-model evolution mechanism—periodically applying safety fine-tuning to the victim model and actively suppressing rewards in already discovered attack regions. This encourages continual exploration of novel vulnerabilities and naturally induces a curriculum learning trajectory from easy to hard. Contribution/Results: Experiments show that, with only a 6% increase in computational overhead, cross-attack success rate improves from 0.07% to 31.28%, representing a >400× relative gain—substantially outperforming state-of-the-art RL-based red-teaming methods.
📝 Abstract
We address the challenge of generating diverse attack prompts for large language models (LLMs) that elicit harmful behaviors (e.g., insults, sexual content) and are used for safety fine-tuning. Rather than relying on manual prompt engineering, attacker LLMs can be trained with reinforcement learning (RL) to automatically generate such prompts using only a toxicity classifier as a reward. However, capturing a wide range of harmful behaviors is a significant challenge that requires explicit diversity objectives. Existing diversity-seeking RL methods often collapse to limited modes: once high-reward prompts are found, exploration of new regions is discouraged. Inspired by the active learning paradigm that encourages adaptive exploration, we introduce extit{Active Attacks}, a novel RL-based red-teaming algorithm that adapts its attacks as the victim evolves. By periodically safety fine-tuning the victim LLM with collected attack prompts, rewards in exploited regions diminish, which forces the attacker to seek unexplored vulnerabilities. This process naturally induces an easy-to-hard exploration curriculum, where the attacker progresses beyond easy modes toward increasingly difficult ones. As a result, Active Attacks uncovers a wide range of local attack modes step by step, and their combination achieves wide coverage of the multi-mode distribution. Active Attacks, a simple plug-and-play module that seamlessly integrates into existing RL objectives, unexpectedly outperformed prior RL-based methods -- including GFlowNets, PPO, and REINFORCE -- by improving cross-attack success rates against GFlowNets, the previous state-of-the-art, from 0.07% to 31.28% (a relative gain greater than $400 imes$) with only a 6% increase in computation. Our code is publicly available href{https://github.com/dbsxodud-11/active_attacks}{here}.