Evaluating Multi-Agent Defences Against Jailbreaking Attacks on Large Language Models

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

This work addresses the insufficient robustness of large language models (LLMs) against jailbreaking attacks by proposing and systematically evaluating a multi-agent collaborative defense mechanism. We extend the AutoDefense framework with dual- and tri-agent architectures, integrating state-of-the-art jailbreaking strategies—including BetterDan and JB—for adversarial evaluation. Results demonstrate that multi-agent systems significantly reduce false negative rates (i.e., missed detections of malicious prompts), outperforming single-agent baselines in jailbreak resistance. However, this improvement incurs increased false positive rates and higher inference overhead, exposing an inherent trade-off among security, usability, and efficiency. To our knowledge, this is the first study to quantitatively validate the efficacy of the multi-agent paradigm for LLM safety defense, precisely characterizing its performance boundaries across diverse attack types. The findings establish a scalable, principled pathway toward enhancing LLM robustness against adversarial prompt engineering.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models (LLMs) have raised concerns about jailbreaking attacks, i.e., prompts that bypass safety mechanisms. This paper investigates the use of multi-agent LLM systems as a defence against such attacks. We evaluate three jailbreaking strategies, including the original AutoDefense attack and two from Deepleaps: BetterDan and JB. Reproducing the AutoDefense framework, we compare single-agent setups with two- and three-agent configurations. Our results show that multi-agent systems enhance resistance to jailbreaks, especially by reducing false negatives. However, its effectiveness varies by attack type, and it introduces trade-offs such as increased false positives and computational overhead. These findings point to the limitations of current automated defences and suggest directions for improving alignment robustness in future LLM systems.

Problem

Research questions and friction points this paper is trying to address.

Defending against jailbreaking attacks on large language models

Evaluating multi-agent systems for enhanced resistance to jailbreaks

Analyzing trade-offs in false positives and computational overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent LLM systems defend jailbreaking attacks

Compare single-agent with multi-agent configurations

Enhance resistance by reducing false negatives

🔎 Similar Papers

Defending Jailbreak Prompts via In-Context Adversarial Game