Active Attacks: Red-teaming LLMs via Adaptive Environments

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing red-teaming approaches suffer from insufficient prompt diversity and premature convergence to local attack patterns. Method: We propose Active Attacks, an adaptive reinforcement learning framework (employing PPO, REINFORCE, or GFlowNets) that uses a toxicity classifier as the reward signal and introduces a dynamic victim-model evolution mechanism—periodically applying safety fine-tuning to the victim model and actively suppressing rewards in already discovered attack regions. This encourages continual exploration of novel vulnerabilities and naturally induces a curriculum learning trajectory from easy to hard. Contribution/Results: Experiments show that, with only a 6% increase in computational overhead, cross-attack success rate improves from 0.07% to 31.28%, representing a >400× relative gain—substantially outperforming state-of-the-art RL-based red-teaming methods.

Technology Category

Application Category

📝 Abstract

We address the challenge of generating diverse attack prompts for large language models (LLMs) that elicit harmful behaviors (e.g., insults, sexual content) and are used for safety fine-tuning. Rather than relying on manual prompt engineering, attacker LLMs can be trained with reinforcement learning (RL) to automatically generate such prompts using only a toxicity classifier as a reward. However, capturing a wide range of harmful behaviors is a significant challenge that requires explicit diversity objectives. Existing diversity-seeking RL methods often collapse to limited modes: once high-reward prompts are found, exploration of new regions is discouraged. Inspired by the active learning paradigm that encourages adaptive exploration, we introduce extit{Active Attacks}, a novel RL-based red-teaming algorithm that adapts its attacks as the victim evolves. By periodically safety fine-tuning the victim LLM with collected attack prompts, rewards in exploited regions diminish, which forces the attacker to seek unexplored vulnerabilities. This process naturally induces an easy-to-hard exploration curriculum, where the attacker progresses beyond easy modes toward increasingly difficult ones. As a result, Active Attacks uncovers a wide range of local attack modes step by step, and their combination achieves wide coverage of the multi-mode distribution. Active Attacks, a simple plug-and-play module that seamlessly integrates into existing RL objectives, unexpectedly outperformed prior RL-based methods -- including GFlowNets, PPO, and REINFORCE -- by improving cross-attack success rates against GFlowNets, the previous state-of-the-art, from 0.07% to 31.28% (a relative gain greater than $400 imes$) with only a 6% increase in computation. Our code is publicly available href{https://github.com/dbsxodud-11/active_attacks}{here}.

Problem

Research questions and friction points this paper is trying to address.

Generating diverse attack prompts for LLM safety testing

Overcoming limited exploration in RL-based red-teaming methods

Adapting attacks to evolving victim models during fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Active Attacks algorithm adapts attacks to evolving victim models

Uses periodic safety fine-tuning to force exploration of vulnerabilities

Seamlessly integrates as plug-and-play module into RL objectives

🔎 Similar Papers

Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)