PandaGuard: Systematic Evaluation of LLM Safety in the Era of Jailbreaking Attacks

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) are vulnerable to jailbreaking attacks, yet existing safety evaluations are fragmented and lack systematic rigor. Method: We propose the first multi-agent evaluation framework specifically designed for LLM jailbreaking—comprising attacker, defender, and judge agents—and support 19 attack strategies, 12 defense mechanisms, and diverse judgment policies. Our modular multi-agent paradigm enables scalable, reproducible assessment and yields PandaBench, a standardized benchmark covering 49 models and over 3 billion tokens. Contribution/Results: Experiments reveal a quantifiable vulnerability spectrum across models; a pronounced trade-off between efficacy and computational cost among the 12 defenses; and substantial inter-judge inconsistency, causing up to 18.7% variance in safety scores. All code, configurations, and results are publicly released.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have achieved remarkable capabilities but remain vulnerable to adversarial prompts known as jailbreaks, which can bypass safety alignment and elicit harmful outputs. Despite growing efforts in LLM safety research, existing evaluations are often fragmented, focused on isolated attack or defense techniques, and lack systematic, reproducible analysis. In this work, we introduce PandaGuard, a unified and modular framework that models LLM jailbreak safety as a multi-agent system comprising attackers, defenders, and judges. Our framework implements 19 attack methods and 12 defense mechanisms, along with multiple judgment strategies, all within a flexible plugin architecture supporting diverse LLM interfaces, multiple interaction modes, and configuration-driven experimentation that enhances reproducibility and practical deployment. Built on this framework, we develop PandaBench, a comprehensive benchmark that evaluates the interactions between these attack/defense methods across 49 LLMs and various judgment approaches, requiring over 3 billion tokens to execute. Our extensive evaluation reveals key insights into model vulnerabilities, defense cost-performance trade-offs, and judge consistency. We find that no single defense is optimal across all dimensions and that judge disagreement introduces nontrivial variance in safety assessments. We release the code, configurations, and evaluation results to support transparent and reproducible research in LLM safety.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM vulnerabilities to jailbreak attacks systematically
Assessing safety defense trade-offs and judge consistency
Developing a unified framework for reproducible LLM safety analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent system modeling jailbreak safety
Flexible plugin architecture for diverse methods
Comprehensive benchmark evaluating attack-defense interactions
🔎 Similar Papers
No similar papers found.
G
Guobin Shen
Beijing Institute of AI Safety and Governance (Beijing-AISI), Beijing Key Laboratory of Safe AI and Superalignment, BrainCog Lab, CASIA
Dongcheng Zhao
Dongcheng Zhao
Beijing Institute of AI Safety and Governance
Spiking Neural NetworksEvent Based VisionBrain-inspired AILLM Safety
L
Linghao Feng
BrainCog Lab, CASIA
X
Xiang He
BrainCog Lab, CASIA
J
Jihang Wang
BrainCog Lab, CASIA
S
Sicheng Shen
BrainCog Lab, CASIA
H
Haibo Tong
BrainCog Lab, CASIA
Yiting Dong
Yiting Dong
Peking University, Institute of Automation, CAS
Brain Inspired IntelligenceSpiking Neural NetworkEvent-based VisionLarge Language Model
J
Jindong Li
BrainCog Lab, CASIA
Xiang Zheng
Xiang Zheng
Department of Computer Science, City University of Hong Kong
Reinforcement LearningTrustworthy AIEmbodied AI
Y
Yi Zeng
Beijing Institute of AI Safety and Governance (Beijing-AISI), Beijing Key Laboratory of Safe AI and Superalignment, BrainCog Lab, CASIA, Long-term AI