Efficient LLM Safety Evaluation through Multi-Agent Debate

📅 2025-11-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating the safety of large language models (LLMs) is costly and poorly scalable. Method: This paper proposes a multi-agent debate framework powered by small language models (SLMs), comprising Critic, Defender, and Judge agents that engage in three-round structured interaction with value-aligned reasoning to detect jailbreak attack semantics at fine granularity. To support evaluation, we introduce HAJailBench—the first large-scale, human-annotated jailbreak benchmark—featuring expert-level, fine-grained annotations for assessing safety judgments and Judge reliability. Contribution/Results: Experiments show our framework achieves judgment consistency comparable to GPT-4o on HAJailBench while reducing inference cost by over 90%, significantly improving evaluation efficiency, accuracy, and scalability. This work provides the first empirical validation of lightweight multi-agent debate as an effective and robust approach for LLM safety assessment.

Technology Category

Application Category

📝 Abstract
Safety evaluation of large language models (LLMs) increasingly relies on LLM-as-a-Judge frameworks, but the high cost of frontier models limits scalability. We propose a cost-efficient multi-agent judging framework that employs Small Language Models (SLMs) through structured debates among critic, defender, and judge agents. To rigorously assess safety judgments, we construct HAJailBench, a large-scale human-annotated jailbreak benchmark comprising 12,000 adversarial interactions across diverse attack methods and target models. The dataset provides fine-grained, expert-labeled ground truth for evaluating both safety robustness and judge reliability. Our SLM-based framework achieves agreement comparable to GPT-4o judges on HAJailBench while substantially reducing inference cost. Ablation results show that three rounds of debate yield the optimal balance between accuracy and efficiency. These findings demonstrate that structured, value-aligned debate enables SLMs to capture semantic nuances of jailbreak attacks and that HAJailBench offers a reliable foundation for scalable LLM safety evaluation.
Problem

Research questions and friction points this paper is trying to address.

Reducing high costs of frontier LLMs in safety evaluation
Developing scalable jailbreak benchmark with expert annotations
Optimizing multi-agent debate structure for safety judgment accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent debate framework using Small Language Models
HAJailBench benchmark with 12,000 human-annotated jailbreak interactions
Three-round debate structure balancing accuracy and cost efficiency
D
Dachuan Lin
Beijing Institute of AI Safety and Governance, China; Beijing Key Laboratory of Safe AI and Super Alignment, China; Institute of Automation, Chinese Academy of Sciences, China; CSE, The Chinese University of Hong Kong, Hong Kong SAR, China
G
Guobin Shen
Institute of Automation, Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China
Zihao Yang
Zihao Yang
New York University
Natural Language Processing
T
Tianrong Liu
Department of Mathematics, The Chinese University of Hong Kong, Hong Kong SAR, China
Dongcheng Zhao
Dongcheng Zhao
Beijing Institute of AI Safety and Governance
Spiking Neural NetworksEvent Based VisionBrain-inspired AILLM Safety
Y
Yi Zeng
Beijing Institute of AI Safety and Governance, China; Beijing Key Laboratory of Safe AI and Super Alignment, China; Institute of Automation, Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China