AgenticRed: Optimizing Agentic Systems for Automated Red-teaming

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the limitations of existing automated red-teaming approaches, which are constrained by manually designed workflows and struggle to efficiently explore the system design space. The authors propose a novel framework that, for the first time, formulates red-teaming as an agent-based system design problem. By integrating in-context learning with large language models and an evolutionary selection mechanism, the framework enables end-to-end autonomous evolution of red-team system architectures without human intervention. This approach overcomes the conventional paradigm of optimizing attack strategies within fixed structures. Empirical results demonstrate remarkable performance: achieving 96% and 98% attack success rates on Llama-2-7B and Llama-3-8B, respectively, and 100% success on both GPT-3.5-Turbo and GPT-4o-mini, significantly outperforming prior methods and exhibiting strong cross-model transferability.

Technology Category

Application Category

📝 Abstract

While recent automated red-teaming methods show promise for systematically exposing model vulnerabilities, most existing approaches rely on human-specified workflows. This dependence on manually designed workflows suffers from human biases and makes exploring the broader design space expensive. We introduce AgenticRed, an automated pipeline that leverages LLMs'in-context learning to iteratively design and refine red-teaming systems without human intervention. Rather than optimizing attacker policies within predefined structures, AgenticRed treats red-teaming as a system design problem. Inspired by methods like Meta Agent Search, we develop a novel procedure for evolving agentic systems using evolutionary selection, and apply it to the problem of automatic red-teaming. Red-teaming systems designed by AgenticRed consistently outperform state-of-the-art approaches, achieving 96% attack success rate (ASR) on Llama-2-7B (36% improvement) and 98% on Llama-3-8B on HarmBench. Our approach exhibits strong transferability to proprietary models, achieving 100% ASR on GPT-3.5-Turbo and GPT-4o, and 60% on Claude-Sonnet-3.5 (24% improvement). This work highlights automated system design as a powerful paradigm for AI safety evaluation that can keep pace with rapidly evolving models.

Problem

Research questions and friction points this paper is trying to address.

automated red-teaming

human bias

design space exploration

AI safety evaluation

agentic systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

AgenticRed

automated red-teaming

evolutionary system design