SIRAJ: Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

LLM-based agents face novel security risks in planning and tool invocation, yet existing red-teaming methodologies lack a general-purpose black-box evaluation framework. Method: This paper introduces the first generic red-teaming framework for LLM agents, featuring a dynamic two-stage process that generates diverse test cases and iteratively constructs adversarial attacks based on execution traces. A key innovation is structured reasoning distillation: a teacher model’s structured reasoning guides the training of a lightweight attack model, enabling significant computational efficiency gains (e.g., model size reduction from 671B to 8B) while improving attack success rate by 100% and tool-call trajectory coverage by 2–2.5×. Contribution/Results: The framework integrates black-box testing, dynamic seed generation, and model compression, achieving both high vulnerability detection capability and strong cross-scenario generalizability.

Technology Category

Application Category

📝 Abstract

The ability of LLM agents to plan and invoke tools exposes them to new safety risks, making a comprehensive red-teaming system crucial for discovering vulnerabilities and ensuring their safe deployment. We present SIRAJ: a generic red-teaming framework for arbitrary black-box LLM agents. We employ a dynamic two-step process that starts with an agent definition and generates diverse seed test cases that cover various risk outcomes, tool-use trajectories, and risk sources. Then, it iteratively constructs and refines model-based adversarial attacks based on the execution trajectories of former attempts. To optimize the red-teaming cost, we present a model distillation approach that leverages structured forms of a teacher model's reasoning to train smaller models that are equally effective. Across diverse evaluation agent settings, our seed test case generation approach yields 2 -- 2.5x boost to the coverage of risk outcomes and tool-calling trajectories. Our distilled 8B red-teamer model improves attack success rate by 100%, surpassing the 671B Deepseek-R1 model. Our ablations and analyses validate the effectiveness of the iterative framework, structured reasoning, and the generalization of our red-teamer models.

Problem

Research questions and friction points this paper is trying to address.

Red-teaming framework discovers LLM agent vulnerabilities

Generates diverse test cases for risk coverage

Uses distilled models to optimize attack efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic two-step process generates diverse test cases

Model distillation trains smaller effective red-teamer models

Iterative adversarial attacks refine based on execution trajectories

🔎 Similar Papers

Adaptive In-conversation Team Building for Language Model Agents