Evaluating Multi-Agent Defences Against Jailbreaking Attacks on Large Language Models

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the insufficient robustness of large language models (LLMs) against jailbreaking attacks by proposing and systematically evaluating a multi-agent collaborative defense mechanism. We extend the AutoDefense framework with dual- and tri-agent architectures, integrating state-of-the-art jailbreaking strategies—including BetterDan and JB—for adversarial evaluation. Results demonstrate that multi-agent systems significantly reduce false negative rates (i.e., missed detections of malicious prompts), outperforming single-agent baselines in jailbreak resistance. However, this improvement incurs increased false positive rates and higher inference overhead, exposing an inherent trade-off among security, usability, and efficiency. To our knowledge, this is the first study to quantitatively validate the efficacy of the multi-agent paradigm for LLM safety defense, precisely characterizing its performance boundaries across diverse attack types. The findings establish a scalable, principled pathway toward enhancing LLM robustness against adversarial prompt engineering.

Technology Category

Application Category

📝 Abstract
Recent advances in large language models (LLMs) have raised concerns about jailbreaking attacks, i.e., prompts that bypass safety mechanisms. This paper investigates the use of multi-agent LLM systems as a defence against such attacks. We evaluate three jailbreaking strategies, including the original AutoDefense attack and two from Deepleaps: BetterDan and JB. Reproducing the AutoDefense framework, we compare single-agent setups with two- and three-agent configurations. Our results show that multi-agent systems enhance resistance to jailbreaks, especially by reducing false negatives. However, its effectiveness varies by attack type, and it introduces trade-offs such as increased false positives and computational overhead. These findings point to the limitations of current automated defences and suggest directions for improving alignment robustness in future LLM systems.
Problem

Research questions and friction points this paper is trying to address.

Defending against jailbreaking attacks on large language models
Evaluating multi-agent systems for enhanced resistance to jailbreaks
Analyzing trade-offs in false positives and computational overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent LLM systems defend jailbreaking attacks
Compare single-agent with multi-agent configurations
Enhance resistance by reducing false negatives
🔎 Similar Papers
No similar papers found.
M
Maria Carolina Cornelia Wit
Department of Computer Science, University of Luxembourg, Esch-sur-Alzette, Luxembourg
Jun Pang
Jun Pang
University of Luxembourg
formal methodsgraph machine learningsecurity and privacysystems biologycomplex networks