🤖 AI Summary
Existing LLM red-teaming methods predominantly rely on static prompt templates or single-turn attacks, failing to capture the dynamic, strategic nature of realistic adversarial dialogues. This work reformulates red-teaming as a Markov Decision Process (MDP) and introduces a generative agent framework grounded in hierarchical reinforcement learning, enabling multi-turn, long-horizon, strategically adaptive attacks. We further propose a token-level fine-grained harmfulness reward function that precisely guides and optimizes attack trajectories. Our approach overcomes fundamental limitations of conventional template- or single-turn–based methods, achieving state-of-the-art performance across multiple benchmarks. It significantly enhances vulnerability discovery—particularly uncovering complex, latent, and deeply embedded security risks—thereby improving the depth, rigor, and practical effectiveness of LLM safety evaluation.
📝 Abstract
Red teaming is critical for identifying vulnerabilities and building trust in current LLMs. However, current automated methods for Large Language Models (LLMs) rely on brittle prompt templates or single-turn attacks, failing to capture the complex, interactive nature of real-world adversarial dialogues. We propose a novel paradigm: training an AI to strategically `break' another AI. By formalizing red teaming as a Markov Decision Process (MDP) and employing a hierarchical Reinforcement Learning (RL) framework, we effectively address the inherent sparse reward and long-horizon challenges. Our generative agent learns coherent, multi-turn attack strategies through a fine-grained, token-level harm reward, enabling it to uncover subtle vulnerabilities missed by existing baselines. This approach sets a new state-of-the-art, fundamentally reframing LLM red teaming as a dynamic, trajectory-based process (rather than a one-step test) essential for robust AI deployment.