Adaptive Instruction Composition for Automated LLM Red-Teaming

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the challenge in current red-teaming approaches for large language models, which struggle to simultaneously achieve high effectiveness and diversity when generating adversarial prompts—random composition strategies yield limited performance, while trial-and-error methods suffer from insufficient semantic coverage. To overcome this, the paper proposes an adaptive instruction composition framework that innovatively integrates contrastive embeddings with a lightweight neural contextual bandit. Within a large combinatorial space, the framework leverages reinforcement learning to dynamically balance exploration and exploitation, thereby generating customized harmful instructions that are both highly effective and diverse. Experimental results on the HarmBench benchmark demonstrate that the proposed method significantly outperforms existing random and adaptive red-teaming techniques, achieving simultaneous improvements in attack success rate and prompt diversity, while also exhibiting strong cross-model transferability.

Technology Category

Application Category

📝 Abstract

Many approaches to LLM red-teaming leverage an attacker LLM to discover jailbreaks against a target. Several of them task the attacker with identifying effective strategies through trial and error, resulting in a semantically limited range of successes. Another approach discovers diverse attacks by combining crowdsourced harmful queries and tactics into instructions for the attacker, but does so at random, limiting effectiveness. This article introduces a novel framework, Adaptive Instruction Composition, that combines crowdsourced texts according to an adaptive mechanism trained to jointly optimize effectiveness with diversity. We use reinforcement learning to balance exploration with exploitation in a combinatorial space of instructions to guide the attacker toward diverse generations tailored to target vulnerabilities. We demonstrate that our approach substantially outperforms random combination on a set of effectiveness and diversity metrics, even under model transfer. Further, we show that it surpasses a host of recent adaptive approaches on Harmbench. We employ a lightweight neural contextual bandit that adapts to contrastive embedding inputs, and provide ablations suggesting that the contrastive pretraining enables the network to rapidly generalize and scale to the massive space as it learns.

Problem

Research questions and friction points this paper is trying to address.

LLM red-teaming

instruction composition

attack diversity

jailbreak

adversarial attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Instruction Composition

LLM Red-Teaming

Reinforcement Learning