π€ AI Summary
LLM-based agents face novel security risks in planning and tool invocation, yet existing red-teaming methodologies lack a general-purpose black-box evaluation framework. Method: This paper introduces the first generic red-teaming framework for LLM agents, featuring a dynamic two-stage process that generates diverse test cases and iteratively constructs adversarial attacks based on execution traces. A key innovation is structured reasoning distillation: a teacher modelβs structured reasoning guides the training of a lightweight attack model, enabling significant computational efficiency gains (e.g., model size reduction from 671B to 8B) while improving attack success rate by 100% and tool-call trajectory coverage by 2β2.5Γ. Contribution/Results: The framework integrates black-box testing, dynamic seed generation, and model compression, achieving both high vulnerability detection capability and strong cross-scenario generalizability.
π Abstract
The ability of LLM agents to plan and invoke tools exposes them to new safety risks, making a comprehensive red-teaming system crucial for discovering vulnerabilities and ensuring their safe deployment. We present SIRAJ: a generic red-teaming framework for arbitrary black-box LLM agents. We employ a dynamic two-step process that starts with an agent definition and generates diverse seed test cases that cover various risk outcomes, tool-use trajectories, and risk sources. Then, it iteratively constructs and refines model-based adversarial attacks based on the execution trajectories of former attempts. To optimize the red-teaming cost, we present a model distillation approach that leverages structured forms of a teacher model's reasoning to train smaller models that are equally effective. Across diverse evaluation agent settings, our seed test case generation approach yields 2 -- 2.5x boost to the coverage of risk outcomes and tool-calling trajectories. Our distilled 8B red-teamer model improves attack success rate by 100%, surpassing the 671B Deepseek-R1 model. Our ablations and analyses validate the effectiveness of the iterative framework, structured reasoning, and the generalization of our red-teamer models.