Immunity memory-based jailbreak detection: multi-agent adaptive guard for large language models

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Large language models (LLMs) are vulnerable to adversarial jailbreaking attacks, while existing detection methods rely on parameter fine-tuning—entailing high update costs and poor generalization. To address this, we propose an immune memory–inspired multi-agent adaptive defense framework that requires no model parameter tuning. Our approach leverages activation-value extraction, dynamic feature matching, memory bank comparison, and multi-agent collaborative supervision to rapidly identify and perform secondary filtering of unseen jailbreaking inputs. Its core innovation lies in the first application of biological immune memory mechanisms to LLM security, enabling a dynamically updatable defense system with strong generalization capability. Extensive experiments across five open-source LLMs demonstrate that our method achieves 98% detection accuracy and a 96% F1 score, significantly outperforming current state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have become foundational in AI systems, yet they remain vulnerable to adversarial jailbreak attacks. These attacks involve carefully crafted prompts that bypass safety guardrails and induce models to produce harmful content. Detecting such malicious input queries is therefore critical for maintaining LLM safety. Existing methods for jailbreak detection typically involve fine-tuning LLMs as static safety LLMs using fixed training datasets. However, these methods incur substantial computational costs when updating model parameters to improve robustness, especially in the face of novel jailbreak attacks. Inspired by immunological memory mechanisms, we propose the Multi-Agent Adaptive Guard (MAAG) framework for jailbreak detection. The core idea is to equip guard with memory capabilities: upon encountering novel jailbreak attacks, the system memorizes attack patterns, enabling it to rapidly and accurately identify similar threats in future encounters. Specifically, MAAG first extracts activation values from input prompts and compares them to historical activations stored in a memory bank for quick preliminary detection. A defense agent then simulates responses based on these detection results, and an auxiliary agent supervises the simulation process to provide secondary filtering of the detection outcomes. Extensive experiments across five open-source models demonstrate that MAAG significantly outperforms state-of-the-art (SOTA) methods, achieving 98% detection accuracy and a 96% F1-score across a diverse range of attack scenarios.

Problem

Research questions and friction points this paper is trying to address.

Detects adversarial jailbreak attacks on large language models

Enables adaptive memory-based pattern recognition for novel threats

Reduces computational costs compared to static fine-tuning methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent adaptive guard with memory capabilities

Compares prompt activations to historical memory bank

Uses defense and auxiliary agents for simulation supervision

🔎 Similar Papers

Defending Jailbreak Prompts via In-Context Adversarial Game