BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Malicious agents in LLM-driven multi-agent systems (MAS) can distort collective decision-making by propagating adversarial messages—a security threat for which existing supervised defenses are ineffective against unseen attacks due to their reliance on labeled attack data. Method: We propose the first unsupervised defense framework that models only benign interaction patterns—without requiring any attack labels. It employs a hierarchical agent encoder to jointly capture individual, neighborhood, and global interaction dynamics; integrates directional noise injection and contrastive learning for robust representation learning; and introduces a corruption-guided detector for reliable anomaly identification. Contribution/Results: Evaluated across diverse communication topologies, our framework achieves high detection accuracy against previously unseen attacks—including prompt injection, memory poisoning, and tool misuse—outperforming supervised baselines significantly. It demonstrates strong generalization capability, practical deployability, and resilience to distributional shifts in agent behavior.

Technology Category

Application Category

📝 Abstract
The security of LLM-based multi-agent systems (MAS) is critically threatened by propagation vulnerability, where malicious agents can distort collective decision-making through inter-agent message interactions. While existing supervised defense methods demonstrate promising performance, they may be impractical in real-world scenarios due to their heavy reliance on labeled malicious agents to train a supervised malicious detection model. To enable practical and generalizable MAS defenses, in this paper, we propose BlindGuard, an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors. To this end, we establish a hierarchical agent encoder to capture individual, neighborhood, and global interaction patterns of each agent, providing a comprehensive understanding for malicious agent detection. Meanwhile, we design a corruption-guided detector that consists of directional noise injection and contrastive learning, allowing effective detection model training solely on normal agent behaviors. Extensive experiments show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across MAS with various communication patterns while maintaining superior generalizability compared to supervised baselines. The code is available at: https://github.com/MR9812/BlindGuard.
Problem

Research questions and friction points this paper is trying to address.

Detects malicious agents in LLM-based multi-agent systems without labeled data
Addresses propagation vulnerability in collective decision-making via message interactions
Provides unsupervised defense against diverse attack types in MAS
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised defense method without attack labels
Hierarchical agent encoder captures interaction patterns
Corruption-guided detector with noise injection
🔎 Similar Papers
No similar papers found.
Rui Miao
Rui Miao
Meta
NetworkingNetworked SystemsDistributed Systems
Y
Yixin Liu
School of Information and Communication Technology, Griffith University, Goldcoast, Australia
Yili Wang
Yili Wang
Jilin University
Graph Neural Networks
X
Xu Shen
School of Artificial Intelligence, Jilin University, Changchun, China
Yue Tan
Yue Tan
University of New South Wales
Machine LearningFederated LearningReinforcement Learning
Y
Yiwei Dai
School of Artificial Intelligence, Jilin University, Changchun, China
Shirui Pan
Shirui Pan
Professor, ARC Future Fellow, FQA, Director of TrustAGI Lab, Griffith University
Data MiningMachine LearningGraph Neural NetworksTrustworthy AITime Series
X
Xin Wang
School of Artificial Intelligence, Jilin University, Changchun, China