AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address adversarial attacks, information leakage, and compliance risks during inference in large language models (LLMs), this paper proposes a test-time scalable multi-agent collaborative defense paradigm that requires no model retraining. The system comprises four specialized agents—Orchestrator, Deflector, Responder, and Evaluator—enabling runtime self-reflection, dynamic prompt optimization (via DSPy), and cooperative reasoning to ensure end-to-end secure outputs and continuous self-improvement. Evaluated on the WMDP benchmark, it achieves a 99.5% defense success rate using only 20 samples; improves StrongReject jailbreak resistance by 51%; and reduces PHTest false rejection rate to 7.9%, significantly outperforming state-of-the-art methods. The core contribution is the first “test-time adaptive multi-agent defense” architecture, uniquely balancing robustness, regulatory compliance, and zero-fine-tuning deployability.

Technology Category

Application Category

📝 Abstract

We introduce AegisLLM, a cooperative multi-agent defense against adversarial attacks and information leakage. In AegisLLM, a structured workflow of autonomous agents - orchestrator, deflector, responder, and evaluator - collaborate to ensure safe and compliant LLM outputs, while self-improving over time through prompt optimization. We show that scaling agentic reasoning system at test-time - both by incorporating additional agent roles and by leveraging automated prompt optimization (such as DSPy)- substantially enhances robustness without compromising model utility. This test-time defense enables real-time adaptability to evolving attacks, without requiring model retraining. Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM. On the WMDP unlearning benchmark, AegisLLM achieves near-perfect unlearning with only 20 training examples and fewer than 300 LM calls. For jailbreaking benchmarks, we achieve 51% improvement compared to the base model on StrongReject, with false refusal rates of only 7.9% on PHTest compared to 18-55% for comparable methods. Our results highlight the advantages of adaptive, agentic reasoning over static defenses, establishing AegisLLM as a strong runtime alternative to traditional approaches based on model modifications. Code is available at https://github.com/zikuicai/aegisllm

Problem

Research questions and friction points this paper is trying to address.

Defending LLMs against adversarial attacks and information leakage

Enhancing robustness without compromising model utility

Enabling real-time adaptability to evolving attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cooperative multi-agent defense against attacks

Structured workflow with autonomous agent roles

Automated prompt optimization enhances robustness

🔎 Similar Papers

The Emerged Security and Privacy of LLM Agent: A Survey with Case Studies